Cooling System Failure / Systems Context

Operations Overview

Seekers Spirits is a craft distillery in Phnom Penh, Cambodia. At the time of this incident, production ran on two stills: a 300L legacy still and a 2000L still installed in early 2023. The 300L still had a stable, established operating pattern. The 2000L still handled larger runs, but its cooling system depended on a reservoir originally sized for the smaller equipment. I owned quality control end-to-end, from tasting through to final release decisions.

Still Configuration

The 2000L still used a main condenser and pre-cooler connected to a shared coolant reservoir. That reservoir had originally been designed for the smaller 300L still and was reused without modification. In Cambodia, ambient temperatures typically range from 28 to 35°C, which reduced cooling efficiency. During longer runs, heat entered the system faster than it could be removed.

This created a gradual rise in coolant temperature during long runs, with no mechanism to detect or record it.

Observability Gaps

Two observability gaps defined both the failure and the investigation.

GAP 1
Display-only sensors with no retention
The 2000L still displayed liquid output temperature in real time, but nothing was written to storage. During Batch 50, the dashboard was not checked because visible condensation was taken as a sign that everything was working. The only recorded temperature was 61.8°C, noted after the run had already finished when the distillate felt unusually hot. By then, the batch was complete and the data had not been retained.
GAP 2
Coolant reservoir not measured during production
Coolant reservoir temperature was never measured during production. The 1 to 2°C per hour drift that caused the failure was only identified later through post-incident testing. During Batch 50, there was no reservoir reading to monitor, no threshold to define, and no intervention point. The data did not exist.

Together, these gaps meant there was no production-time record to analyse. The failure could not be observed as it developed and had to be reconstructed after the fact.

In practice, detection depended on someone checking the system during the run. During Batch 50, that did not happen.