From time to time, an instance can freeze or crash and become unresponsive. When this happens, you will find that you are unable to access the instance over SSH (the connection will time out, or you might see no route to host messages).
An instance can crash or freeze up like this for several reasons; usually it’s due to resource problems on the instance, with system load increasing due to memory over-usage or possibly an application bug that might be stressing the instance in some other way. Occasionally, the problem may be due to issues with the underlying host server that runs the instance, or maybe a kernel bug.
When viewing your Dashboard, you will notice that the status icon for the instance in question might still be green - this is because these icons are designed to show you whether the last Chef run was successful; they are not aware of a system’s up/down status.
Following up on failures
Often, customers ask us to follow up occurrences of frozen instances to see if we can find out what caused the problem. Unfortunately, when an instance is terminated, the log files disappear along with it, so there is not usually any insight that we can gain into the reasons for a crash. However, we are sometimes able to find out from Amazon if there was an issue with the host server running that instance, so it is a good idea to make a note of the instance ID from the Dashboard before you initiate a termination.
While we cannot often find out the reason for a crash like this, it’s certainly a good idea to make sure that you have the email alerts enabled for all your environments in the Dashboard. Doing this should make sure that you are being made aware of any possible resource problems (running out of RAM, high I/O wait) or system overloading that may have been happening on the run up to the instance failing. Having these alerts to hand can sometimes help with figuring out what caused the problem.
Recovering from a frozen instance
After you have lost the ability to connect to the instance over SSH, there are three possible recovery options:
- Request Engine Yard Support issue a reboot of the instance at AWS. This reboots the existing instance on the current hardware and can be effective if the issue is caused by the instance itself and not the host hardware.
- Use the restart button for the instance in the environment page of the dashboard to restart the instance on new hardware - see EBS backed root volume instances.
- Terminate the instance and rebuild it from snapshots - see the host backed root volume instances.
EBS backed root volume instances
The majority of instances running in environments running the Engine Yard stable-v2 stack and above and being 3rd Generation instances (C3/M3) and newer have EBS backed root volumes, meaning that the instance is not tied to a specific host. This allows for the instances to be restarted on new hardware, inheriting all EBS volumes and thus all configuration and avoiding the need for the instance to be rebuilt.
Restarting instances is triggered via the Restart button listed next to each instance on the environment page of the dashboard. Restarting an instance will cause it to be offline for up to 10 minutes. If the instance does not have an EIP attached then a restart will alter the public hostname/IP of the instance. If the instance runs under a VPC the private hostname/IP address will be maintained on a restart, whilst it will not if the instance runs under EC2-Classic. If the Restart button has no apparent effect on the instance then it is likely it is a host backed root volume instance and you should use the alternate instructions in that section below.
The instance role should be considered when issuing a Restart, so the following should be taken into account:
- For a single-instance (or solo) environment: If no EIP is attached then the public hostname will change, so any DNS records pointing to it will need updating.
- For an application slave: Usually application slaves have no EIP attached, so the public IP will change on restart, as will the private IP if the instance does not run under a VPC. Please monitor the environment page of the dashboard to ensure Chef runs (automatically) after the instance restart, then check the instance logs to ensure it is receiving traffic, and run an Apply should it not.
- For an application master: If the load balancer in use is the default HAProxy then you should promote an app slave to be app master. If the environment makes use of an AWS xLB then the master can be restarted in the same way as a slave, so long as Application Master Takeovers are disabled first.
- For a database replica: Replication should restart automatically, but please watch for replication alerts in order to be safe.
- For a database master: A database master should only be restarted as a last option, when unresponsive and in a site down situation, due to possible data loss and replication issues. We advise contacting Support before actioning a restart. Most database master instances do not have an EIP attached so the public IP will change on restart, as will the private IP if the instance does not run under a VPC. The change in IP address(es) and physical host will require all database connections to be recreated, so we recommend a deploy of your application to restart the app server workers and background jobs (dependent on deploy hooks). More information can be found in this document.
- For a utility instance: Usually utility instances have no EIP attached, so the public IP will change on restart, as will the private IP if the instance does not run under a VPC. Please monitor the environment page of the dashboard to ensure Chef runs (automatically) after the instance restart, then check that any application configuration files referring to the utility instance have updated, and run an Apply should they have not.
Host backed root volume instances
Instances of legacy Engine Yard stack version and AWS instance generation have root volumes that make use of the storage on the host hardware at AWS. This means they are tied to that specific host and cannot be restarted on new hardware, but instead must be replaced.
Note: When rebuilding the instance or environment as per the instructions below, do not worry that the snapshot process will not work. Snapshots are handled by the machines that host the EBS disks, not the instances, so a snapshot can still run against an instance even if it has crashed.
If the unresponsive instance is:
- A single-instance (or solo) environment, then terminate and rebuild the whole environment (See ELT on notes below).
- An application slave, then terminate the specific degraded instance by clicking to the right of the instance on the dashboard, and then add a new instance (See ITR on notes below).
- An application master, then promote an app slave to be app master.
- A database replica, then terminate the specific degraded instance by clicking to the right of the instance on the dashboard, and then add a new instance (See ITR on notes below).
- A database master, then promote a database replica to be database master.
- A utility instance, then terminate the specific degraded instance by clicking to the right of the instance on the dashboard, and then add a new instance, using the most recent snapshot (See ITR on notes below).
Notes:
- You must know the difference between the "Environment Level Terminate" (ELT) and the "Instance Terminate reference" (ITR). The first one is identified by the button "Terminate" located at the top of the environment's page, above its name. The second one is identified by a "Terminate" link next to each instance ID.
- In some cases (and depending on your Support level), Engine Yard might initiate an instance takeover on your behalf. In that case, Engine Yard Support will contact you.
- The environment's settings for takeover preference and failed app master behavior can also affect the way app takeovers occur.
About custom Chef recipes
Because instances or environments need to be terminated and rebuilt, it is important that any customizations that you have made to your environment have been carried out using custom Chef recipes and not just done manually. When Chef recipes have been used, a rebuild should be painless and can get you running again quickly, without any intervention required. Make sure that you are familiar with our custom Chef docs and that you are using recipes for your custom configurations.
If you have feedback or questions about this page, add a comment below. If you need help, submit a ticket with Engine Yard Support.
Comments
Article is closed for comments.