Web interface down
Incident Report for Riju
Postmortem
Posted Jul 27, 2021 - 03:24 UTC

Resolved
Looks like we are back up.
Posted Jul 27, 2021 - 02:30 UTC
Update
Getting pretty close to the fresh node being ready (downloading all those Docker images takes a long time, which is why it's important to always have at least one healthy node during any maintenance operation). Unhealthy node has successfully been quarantined for later investigation. If this fresh node doesn't work, will try spinning up a third node based on an AMI from before https://github.com/raxod502/riju/pull/86 was merged.
Posted Jul 27, 2021 - 02:29 UTC
Update
Detaching the unhealthy node from the autoscaling group and taking it offline without termination so that we can investigate what happened to it later. Somehow, system commands like 'docker exec' and 'docker kill' are hanging indefinitely, and some other weird stuff was going on when I was trying to diagnose the problem.
Posted Jul 27, 2021 - 02:19 UTC
Update
Seems something went wrong with a new node added to the autoscaling group, just after the previous one was terminated (lucky timing). Waiting for another fresh one to spin up in case that fixes the problem.
Posted Jul 27, 2021 - 02:16 UTC
Investigating
Web interface is down due to unhealthy node.
Posted Jul 27, 2021 - 02:10 UTC
This incident affected: Web interface.