This document is new but should be mostly complete at this time. If you do encounter something not covered please reach out to support so we are aware and can assist you further.
Engine Yard gets notification of degraded instances from Amazon and forwards the message so that you can take action.
When you get one of these notifications, it might state the time that the underlying hardware will be taken offline. If there is no time specified, replace the degraded instance as soon as you can. In some cases the hardware may fail sooner than expected and before the planned maintenance can be completed.
The majority of instances running in environments running the Engine Yard stable-v2 stack and above and being 3rd Generation instances (C3/M3) and newer have EBS backed root volumes, meaning that the instance is not tied to a specific host. This allows for the instances to be restarted on new hardware, inheriting all EBS volumes and thus all configuration and avoiding the need for the instance to be rebuilt.
Instances of legacy Engine Yard stack version and AWS instance generation have root volumes that make use of the storage on the host hardware at AWS. This means they are tied to that specific host and cannot be restarted on new hardware, but instead must be replaced.
Things to take into account
- An instance restart can take up to 10 minutes. If restarting a single-instance (or solo) environment then the application will be offline and unreachable during this time, so no maintenance message can be displayed, thus it is recommended this is done off-hours.
- When an instance is restarted, the public and/or private IP can change. An environment wide Chef run should automatically run when the instance restarts in order to update all records, but should you find any misconfiguration then an Apply will update the configuration correctly.
- If the instance does not have an EIP attached, then the public hostname and IP change. The environment needs an 'Apply' so that all instances are aware of the change, and a 'Deploy' if the instance is providing services to others. Any DNS records pointing to the public IP address will need updating.
- If the instances runs under EC-Classic rather than VPC, then the private hostname and IP change. Again, the environment needs an 'Apply' so that all instances are aware of the change, and a 'Deploy' if the instance is providing services to others.
Restarting instances by role
Different factors should be considered for different instance roles, and it is also dependent on the reason for the Restart.
- For information regarding restarting currently running instances to avoid future maintenance or retirement please see this document.
- For information regarding restarted instances that are already in a failed state please see this document.
Known Concerns
MySQL DB Masters with Replicas
The replica references its master's hostname as part of its slave status information. When the IP of the master changes the replica will not be able to reconnect on its own. An easy way to fix this is to simply replace the replica database.
If you'd rather keep the existing replica it is possible to do so by stopping replication, and then issuing a change master statement referencing the appropriate coordinates. To do this:
- Stop replication by running:
mysql -u root -e 'stop slave'
- Grab the current status information by running:
mysql -u root -e 'show slave status\G'| egrep 'Exec_Master_Log_Pos|Relay_Master_Log_File'
- Issue a change master statement with those values and the new db_master hostname:
mysql -u root -e"change master to master_host='#{new_master_hostname}', master_log_file='#{Relay_Master_Log_File}', master_log_pos=#{Exec_Master_Log_Pos}; start slave;"
Postgres DBs
If the instance is a PostgreSQL db on an older stacks that doesn't have /tmp as a tempfs then PostgreSQL may not start on their on due to socket file conflicts. If this is encountered issuing a restart to the Postgres server process with:
sudo -i /etc/init.d/postgresql-$(postgres -V | egrep -o '[0-9]{1,}\.[0-9]{1,}') restart
will result in an error message about a socket file. Removing that socket file will allow the server to start.
Postgres DB Masters with Replicas
Postgres replicas reference their master's hostname as part of the recovery.conf. When the IP of the master changes the configuration of the running server is no longer correct. Chef does automatically update the configuration file Postgres references, so after the chef run completes a Postgres restart on the replica will re-link replication.
sudo -i /etc/init.d/postgresql-$(postgres -V | egrep -o '[0-9]{1,}\.[0-9]{1,}') restart
Mixed Legacy/Non-Legacy Restarts
We have seen at least one case where after a restart of an environment that mixed custom-VPC and non-VPC hosts, the non-VPC hosts restarted without ClassicLink Enabled. It is possible to correct this through the AWS console by re-linking the non-VPC host(s) to the correct VPC. If you require assistance with this please open a ticket with our support team and escalate to our VNOC engineer in IRC.
How to restart an instance
You will find a Restart link next to each instance when viewing the environment.
Comments
Article is closed for comments.