Getting pretty close to the fresh node being ready (downloading all those Docker images takes a long time, which is why it's important to always have at least one healthy node during any maintenance operation). Unhealthy node has successfully been quarantined for later investigation. If this fresh node doesn't work, will try spinning up a third node based on an AMI from before https://github.com/raxod502/riju/pull/86 was merged.
Posted Jul 27, 2021 - 02:29 UTC
Update
Detaching the unhealthy node from the autoscaling group and taking it offline without termination so that we can investigate what happened to it later. Somehow, system commands like 'docker exec' and 'docker kill' are hanging indefinitely, and some other weird stuff was going on when I was trying to diagnose the problem.
Posted Jul 27, 2021 - 02:19 UTC
Update
Seems something went wrong with a new node added to the autoscaling group, just after the previous one was terminated (lucky timing). Waiting for another fresh one to spin up in case that fixes the problem.