From time to time, an instance can freeze or crash and become unresponsive. When this happens, you will find that you are unable to access the instance over SSH (the connection will time out, or you might see no route to host messages).
An instance can crash or freeze up like this for several reasons; usually it’s due to resource problems on the instance, with system load increasing due to memory over-usage or possibly an application bug that might be stressing the instance in some other way. Occasionally, the problem may be due to issues with the underlying host server that runs the instance, or maybe a kernel bug.
When viewing your Dashboard, you will notice that the status icon for the instance in question might still be green - this is because these icons are designed to show you whether the last Chef run was successful; they are not aware of a system’s up/down status.
Following up on failures
Often, customers ask us to follow up occurrences of frozen instances to see if we can find out what caused the problem. Unfortunately, when an instance is terminated, the log files disappear along with it, so there is not usually any insight that we can gain into the reasons for a crash. However, we are sometimes able to find out from Amazon if there was an issue with the host server running that instance, so it is a good idea to make a note of the instance ID from the Dashboard before you initiate a termination.
While we cannot often find out the reason for a crash like this, it’s certainly a good idea to make sure that you have the email alerts enabled for all your environments in the Dashboard. Doing this should make sure that you are being made aware of any possible resource problems (running out of RAM, high I/O wait) or system overloading that may have been happening on the run up to the instance failing. Having these alerts to hand can sometimes help with figuring out what caused the problem.
Recovering from a frozen instance
Note: When rebuilding the instance or environment as per the instructions below, do not worry that the snapshot process will not work. Snapshots are handled by the machines that host the EBS disks, not the instances, so a snapshot can still run against an instance even if it has crashed.
After you have lost the ability to connect to the instance over SSH, the only way to recover is to terminate it and rebuild from snapshots. The way to do this differs, depending on the role of the problematic instance:
- A single-instance (or solo) environment, then terminate and rebuild the whole environment (See ELT on notes below).
- An application slave, then terminate the specific degraded instance by clicking on the dashboard, and then add a new instance (See ITR on notes below).
- An application master, then promote an app slave to be app master.
- A database replica, then terminate the specific degraded instance by clicking on the dashboard, and then add a new instance (See ITR on notes below).
- A database master, then promote a database replica to be database master.
- A utility instance, then terminate the specific degraded instance by clicking on the dashboard, and then add a new instance, using the most recent snapshot (See ITR on notes below).
- You must know the difference between the "Environment Level Terminate" (ELT) and the "Instance Terminate reference" (ITR). The first one is identified by the button "Terminate" located at the top of the environment's page, above its name. The second one is identified by a "Terminate" link next to each instance ID.
- In some cases (and depending on your Support level), Engine Yard might initiate an instance takeover on your behalf. In that case, Engine Yard Support will contact you.
- The environment's settings for takeover preference and failed app master behavior can also affect the way app takeovers occur.
About custom Chef recipes
Because instances or environments need to be terminated and rebuilt, it is important that any customizations that you have made to your environment have been carried out using custom Chef recipes and not just done manually. When Chef recipes have been used, a rebuild should be painless and can get you running again quickly, without any intervention required. Make sure that you are familiar with our custom Chef docs and that you are using recipes for your custom configurations.