Cachix - Nix binary cache as a service was down:
- On Aug 22nd from 16:55 until 18:55 UTC (120 minutes)
- On Aug 23rd from 20:01 until 20:09 UTC (8 minutes)
On the 22nd there was no action from my side; the service recovered itself. I did have monitoring configured and I received email alerts, but I have not noticed them.
I have spent most of the 23rd gathering data and evidence on what went wrong. Just before monitoring stopped receiving data at 16:58 UTC, white-box system monitoring revealed:
- Outgoing bandwidth skyrocketed to 23MB/s
- Resident memory went through the roof to ~90%
On 23rd I have immediately seen the service was down and I've rebooted the machine.
I have spent a significant amount of time trying to determine if a specific request caused this, but it seems likely that it was just an overload, although I have not proved this theory.
a) Server-side is implemented in GHC Haskell, so I have enabled
GHC wiki on Performance says it is indistinguishable from
-O1, in the last week I've seen an approximately 10% reduction of resident memory and most importantly, fewer memory spikes. Again, no hard evidence, time will tell.
b) Most importantly, production now runs with GHCRTS='-M2G' flag, limiting overall heap to 2G of memory, so we are not depending on the Linux OOM killer to handle out of memory situations. It is not entirely clear to me why the machine was unresponsive for two hours since OOM should have kicked in, but during that period there was not a single monitoring datapoint sent.
c) I have configured EKG to send GC stats to datadog so if it happens again, that should provide better insight into what is going on with memory consumption.
Countermeasures to be taken
1) Use a service like Pagerduty to be alerted immediately on the phone
2) Upgrade Datadog agent to version 6, which allows more precise per process monitoring
So far I am quite happy how Haskell works in production. I have taken Well-Typed training on GHC performance and if this turns out to be a space leak, I am confident that I will find it.
The only thing that saddens me, coming from Python, is that GHC has poor profiling options for long-running programs. Compiling GHC with profiling options significantly slows the performance. There is unmerged work making the GHC eventlog useful for such cases, but the state of this work is unclear.
So there it is, the first operational issue with Cachix. Despite this issue, I am happy to have made the choices that both allow me to respond quickly to the needs of Nix community, yet still allow me to further improve and stabilize the code with confidence as the product matures.
Speaking of maturing the product, I will share another announcement soon!