Web interface down

Incident Report for Riju

Postmortem

https://github.com/raxod502/riju/issues/89

Posted Jul 27, 2021 - 03:24 UTC

Resolved

Looks like we are back up.

Posted Jul 27, 2021 - 02:30 UTC

Update

Getting pretty close to the fresh node being ready (downloading all those Docker images takes a long time, which is why it's important to always have at least one healthy node during any maintenance operation). Unhealthy node has successfully been quarantined for later investigation. If this fresh node doesn't work, will try spinning up a third node based on an AMI from before https://github.com/raxod502/riju/pull/86 was merged.

Posted Jul 27, 2021 - 02:29 UTC

Update

Detaching the unhealthy node from the autoscaling group and taking it offline without termination so that we can investigate what happened to it later. Somehow, system commands like 'docker exec' and 'docker kill' are hanging indefinitely, and some other weird stuff was going on when I was trying to diagnose the problem.

Posted Jul 27, 2021 - 02:19 UTC

Update

Seems something went wrong with a new node added to the autoscaling group, just after the previous one was terminated (lucky timing). Waiting for another fresh one to spin up in case that fixes the problem.

Posted Jul 27, 2021 - 02:16 UTC

Investigating

Web interface is down due to unhealthy node.

Posted Jul 27, 2021 - 02:10 UTC

This incident affected: Web interface.