Web interface is down

Incident Report for Riju

Resolved

I've terminated the current instance and will let the ASG bring up another one. Unfortunately, at this time I simply do not have the tools to address the problem in a way that I can guarantee will stick. As such I'm going to prioritize my sleep over Riju's uptime and let things sit overnight. UptimeRobot will continue to update the status page according to detected uptime, but I'm closing the incident.

I think the best next step would be to impose dynamic CPU and memory limits that scale with the number of running containers. This would allow us to guarantee that the system never runs out of resources, which will hopefully make it impossible to get into whatever weird state these instances have been getting into.

Posted Jul 29, 2021 - 05:55 UTC

Investigating

Can confirm, very much not up. I think it's the same root cause as before.

Posted Jul 29, 2021 - 05:29 UTC

Monitoring

Found an open bug report about 'docker exec' hanging, which should basically never happen: https://github.com/docker/for-linux/issues/543.

Update, interestingly enough if I wait long enough, it spits out 'failed to resize tty, using default size' and continues to hang. But now control-C doesn't work as well. I killed from another tty.

sudo dmesg -T | egrep -i 'killed process' returns no results, so it's probably not the OOM killer at fault.

Now I can in fact kill the container with 'docker kill', so that's something. And then 'systemctl restart riju' brings things back up. Idk man.

Posted Jul 29, 2021 - 05:18 UTC

Investigating

ps looks pretty normal, we are running the Docker container itself plus some normal-looking processes inside there. 'docker ps' is legit as well, except that we are not actually running any user containers, despite what ps claims (some invocations of docker-exec.py). 'docker exec' on the app container is hanging.

Posted Jul 29, 2021 - 05:14 UTC

Update

I don't understand what's going on. I have a live SSH connection into the server, and we are only using 600 MB of RAM, and CPU load is a few percent. By all accounts everything should be fine, but for some reason we're not serving traffic (curl localhost:80 hangs).

Posted Jul 29, 2021 - 05:10 UTC

Update

Seems like we've got degraded performance, I'm seeing CPU spikes?

Posted Jul 29, 2021 - 05:05 UTC

Update

We appear to be up.

Posted Jul 29, 2021 - 04:55 UTC

Monitoring

Will wait about 25 minutes and check back to see if we're back up.

Posted Jul 29, 2021 - 04:31 UTC

Update

Getting no metrics and nothing in the console indicates that triggering a reboot is doing anything. In future we really need to do two things:
- get more logs, as already planned, so that next time we can look at activity before the failure
- do some intentional load testing and see if we can reproduce the failure condition on a staging server

Anyway, for now I think the best I can do is terminate the instance and let the ASG bring up another one.

Posted Jul 29, 2021 - 04:30 UTC

Update

I'm not convinced that the reboot actually took. Rebooting again.

Posted Jul 29, 2021 - 04:27 UTC

Update

Memory usage has been high and SSH access appears to be down. I've restarted the server to see if that helps things.

Posted Jul 29, 2021 - 04:25 UTC

Investigating

Got paged for the web interface being down.

Posted Jul 29, 2021 - 04:16 UTC

This incident affected: Web interface.