Replace Degraded Instances

Read this page if you have received notification from Engine Yard that one of your instances is degraded.

What is a degraded instance?

An instance is degraded when the host hardware that the instance is running on is detected as failing. Amazon then plans to shut down the host to carry out maintenance. When this maintenance happens, all instances on the host are terminated.

Engine Yard gets notification of degraded instances from Amazon and forwards the message so that you can take action.

When you get one of these notifications, it might state the time that the underlying hardware will be taken offline. If there is no time specified, replace the degraded instance as soon as you can. In some cases the hardware may fail sooner than expected and before the planned maintenance can be completed.

If you don’t replace the degraded instance in time, it might become frozen. For information about frozen instances, see Fix frozen or crashed instances.

 

To replace a degraded instance

EBS backed root volume instances

The majority of instances running in environments running the Engine Yard stable-v2 stack and above and being 3rd Generation instances (C3/M3) and newer have EBS backed root volumes, meaning that the instance is not tied to a specific host. This allows for the instances to be restarted on new hardware, inheriting all EBS volumes and thus all configuration and avoiding the need for the instance to be rebuilt.

Restarting instances is triggered via the Restart button listed next to each instance on the environment page of the dashboard. Restarting an instance will cause it to be offline for up to 10 minutes. If the instance does not have an EIP attached then a restart will alter the public hostname/IP of the instance. If the instance runs under a VPC the private hostname/IP address will be maintained on a restart, whilst it will not if the instance runs under EC2-Classic. If the Restart button has no apparent effect on the instance then it is likely it is a host backed root volume instance and you should use the alternate instructions in that section below.

Due to the downtime and hostname/IP changes related to an instance restart you should consider the use case before taking any action:

If the degraded instance is:

  • A single-instance (or solo) environment, then a restart can be used. The application will be offline and unreachable during the restart, so no maintenance page can be displayed, thus it is recommended this is done off-hours. Also if no EIP is attached then the public hostname will change, so any DNS records pointing to it will need updating.
  • An application slave, then a restart can be used. The instance will be pulled from an load balancing pool when it is detected to be failing, but if you wish to avoid any chance of requests being directed to the instance as it shuts down then please contact Support about the best methods dependent on the load balancer in use. Usually application slaves have no EIP attached, so the public and private IPs will change on restart. Please monitor the environment page of the dashboard to ensure Chef runs (automatically) after the instance restart, then check the instance logs to ensure it is receiving traffic, and run an Apply should it not.
  • An application master, then if the load balancer in use is the default HAProxy then promote an app slave to be app master. If the environment makes use of an AWS xLB then the master can be restarted in the same way as a slave, so long as Application Master Takeovers are disabled first.
  • A database replica, then a restart can be used. Replication should restart automatically, but please watch for replication alerts in order to be safe.
  • A database master, then promote a database replica to be database master.
  • A utility instance, then a restart can be used. Usually utility instances have no EIP attached, so the public IP will change on restart, as will the private IP if the instance does not run under a VPC. Please monitor the environment page of the dashboard to ensure Chef runs (automatically) after the instance restart, then check that any application configuration files referring to the utility instance have updated, and run an Apply should they have not.

Host backed root volume instances

Instances of legacy Engine Yard stack version and AWS instance generation have root volumes that make use of the storage on the host hardware at AWS. This means they are tied to that specific host and cannot be restarted on new hardware, but instead must be replaced.

If the degraded instance is:

  • A single-instance (or solo) environment, then terminate and rebuild the whole environment (See ELT on notes below).
  • An application slave, then terminate the specific degraded instance by clicking Terminate icon to the right of the instance on the dashboard, and then add a new instance (See ITR on notes below).
  • An application master, then promote an app slave to be app master.
  • A database replica, then terminate the specific degraded instance by clicking Terminate icon to the right of the instance on the dashboard, and then add a new instance (See ITR on notes below).
  • A database master, then promote a database replica to be database master.
  • A utility instance, then terminate the specific degraded instance by clicking Terminate icon to the right of the instance on the dashboard, and then add a new instance, using the most recent snapshot (See ITR on notes below).

Notes:

  • You must know the difference between the "Environment Level Terminate" (ELT) and the "Instance Terminate reference" (ITR). The first one is identified by the button "Terminate" located at the top of the environment's page, above its name. The second one is identified by a "Terminate" link next to each instance ID.
  • In some cases (and depending on your Support level), Engine Yard might initiate an instance takeover on your behalf. In that case, Engine Yard Support will contact you.
  • The environment's settings for takeover preference and failed app master behavior can also affect the way app takeovers occur.

About custom Chef recipes

Because instances or environments need to be terminated and rebuilt, it is important that any customizations that you have made to your environment have been carried out using custom Chef recipes and not just done manually. When Chef recipes have been used, a rebuild should be painless and can get you running again quickly, without any intervention required. Make sure that you are familiar with our custom Chef docs and that you are using recipes for your custom configurations.

More information

For more information about... See...
Frozen or crashed instances Fix Frozen or Crashed Instances
Rebuilding an environment Rebuild an Environment
Promoting an app slave Promote an Application Slave
App master takeover Application Master Takeover
Promoting a db replica Promote a Database Replica

If you have feedback or questions about this page, add a comment below. If you need help, submit a ticket with Engine Yard Support.

Comments

Article is closed for comments.