Database backup process failures Reference

Overview

The database backup activity is monitored by Engine Yard Support to ensure completion of all tasks, so the data can always be restored to the most up to date state in case there is any failure at the customer's instance. Typically, backup processes are scheduled to run at specific intervals and is defined based on the business needs (e.g hourly, bi-hourly, twice a day), controlled by Cron tab jobs at an operating system level.

Database backups are set to generate data export files at /mnt volume, which requires it to:

  • have plenty of disk space available
  • be refreshed with frequency, to eliminate unnecessary and outdated files
  • be copied to an external source for additional protection against loss


This reference explains backup process failures for these following situations:

  1. No space left on device
  2. Simultaneous backup executions are not allowed
  3. Database structures changed during backup


No space left on device

Alert created: Tue Mar 13 2018 13:37:03 GMT+0200
Alert type: process-dbbackup
Alert severity: FAILURE
Alert message: 'No space left on device @ io_write - /mnt/tmp/XXXXXX.2018-03-13T00-44-02.dump (Errno::ENOSPC) 
Details at /var/log/eybackup.log.'

A disk space analysis at /mnt volume shows the current allocation and the lack of space availability:

root@Sandbox - ~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      15G  6.0G  7.9G  44% /
tmpfs          798M  360K  798M  1% /run
dev              10M    0  10M  0% /dev
shm            3.9G  4.0K  3.9G  1% /dev/shm
cgroup_root      10M    0  10M  0% /sys/fs/cgroup
/dev/xvdb        25G  25G    0 100% /mnt ===> VOLUME TOTALLY ALLOCATED
/dev/xvdn        15G  4.2G  9.8G  30% /data
172.31.2.88:/  8.0E  1.0M  8.0E  1% /efs
/dev/xvdm      788G  13G  735G  2% /db

The databases can usually create large backup files, therefore a disk volume allocation can quickly grow and exhaust the available space, preventing not only backup process to finish but any other task that generates file entries at the same volume, like logs, dumps and core files. Hence, the system prompts a FAILURE and aborts.  To avoid receiving this error, enough disk space must be available.

 Actions:

  • Check frequently on old and temporary files that can be removed from the system - make sure they are backed up elsewhere to prevent any issue potentially caused by their absence.
  • Monitor disk space consumption using direct notifications (daily mail notes on allocation/availability), Cron tab scripts to send alerts whenever a threshold is reached (e.g. less than 5 GB).
  • Provide additional disk space to the volume, to avoid important processes to be terminated abruptly.

Simultaneous backup executions are not allowed

Alert created:  2018-03-25T01:02:02+00:00
Alert type: process-dbbackup xxxxx
Alert severity: FAILURE
Alert message:
'Unable to backup xxxxx: already a backup in progress. Use --allow_concurrent to enable concurrent backup runs.
Details at /var/log/eybackup.log.'

Instance role: db_master
Instance name: NA
Public hostname: ec2-xx-xxx-xx-xx.eu-west-2.compute.amazonaws.com

An inspection of the logs shows that a new backup process tries to start before the previous one is finished. This should usually be avoided to:

  • Prevent overloading of disk writing activity at the same volume
  • Not cause unnecessary locks at the database level, which potentially reduces the performance of applications
  • Eliminate conflicts to which is the correct backup version to be restored, in case of a loss


Actions:

  • Check if there is any conflict in writing to disk volume, which is potentially slowing down each backup process and making them finish later than expected (e.g a lot of simultaneous processes generating too many entries in different files).
  • Validate the interval between backups is adequate, giving it enough time to be finished before next one starts (and consistency of data at files is not put at risk).

Database structures are changed during backup

Alert created:  2018-05-02T10:05:51+00:00
Alert type: process-dbbackup XXXXXXXXXX
Alert severity: FAILURE
Alert message:
'app3 backup failed! The error returned was: pid 20491 exit 1: gpg: using subkey 9E586AE6
instead of primary key 3F53ED27\ngpg: using PGP trust model\ngpg:
This key belongs to us\ngpg: reading from `[stdi'

Instance role: db_master
Instance name: NA
Public hostname: ec2-xx-xxx-xxx-xxx.us-east-2.compute.amazonaws.com

This error happens if the structure of database objects metadata is changed during the backup process. For example, they may be changed while they are recreated or receiving additions of columns.

Actions:

  • Validate that the database objects were really going through a valid maintenance, so the failure can be safely disregarded. Otherwise, the error might be related to other issues which need to be investigated.
  • Reschedule the backup processing to take place after any maintenance to avoid future similar errors.

Comments

0 comments

Please sign in to leave a comment.