This document describes PostgreSQL specific alerts; for general instance alerts such as load, or disk usage see About Alerts for Your Environment.
- The Connections Check
- Checkpoint Alerts
- Snaplock Warning
- Idle Transactions
About PostgreSQL Alerts
Engine Yard monitors the health of your PostgreSQL database using a combination of our own custom checks and Bucardo’s check_postgres scripts. Collectd or Nagios (depending on your stack and features) consumes the results of these checks and present them to the Engine Yard dashboard as alerts. The alerts we show you follow this format:
[SEVERITY] [environment-name] [originating-process]: [check-name] [severity] [additional information]
- severity -- The exit code of the check. It can have any of these values: OK, WARNING, FAILURE, CRITICAL, UNKNOWN, etc.
- environment-name -- The name of the environment that originated the alert.
- originating-process -- The process that generated the alert.
- check-name -- The name of the check that ran the process.
- additional information -- Extra information reported by the failing check.
Alert(CRITICAL) MyappProduction process-postgresql: POSTGRES_CHECKPOINT CRITICAL: Last checkpoint was 16204 seconds ago
This sample alert means that in the MyappProduction application, the postgres_checkpoint check raised an alert on the Postgresql process. The checkpoint check issued a severity of critical. The associated message is that the database has not had a checkpoint for about 4.5 hours.
We specify the severity of the monitoring checks based on the thresholds defined when your database was created. This section discusses the most important checks for PostgreSQL and their meanings.
The Connections check
The connections check verifies that the database process is functioning and connections can be established to it.
When do we warn you? We will test the connection to your database every 60 seconds, we will warn you when a connection to the database fails.
What to do if you see this check? Contact Engine Yard Support immediately because your site may be down.
This check determines how long since the last checkpoint has been run. A checkpoint is a point in the transaction log sequence at which all data files have been updated to reflect the information in the log and flushed to disk. If your system crashes, recovery will start from the last known checkpoint. The checkpoint check helps us confirm two things:
- Your database consistently takes forward the position in which recovery is started.
- In the case of replicas, your standby is keeping up with its master (because the activity the replica sees is what the master has sent it).
When do we warn you? We issue a WARNING severity when checkpoint delays range from 20 to 30 minutes. For delays that exceed 30 minutes, the severity of the alert is CRITICAL.
What to do if you see this check? Contact Engine Yard Support if you see a severity of CRITICAL or FAILURE.
The snaplock check alerts us of inconsistent snapshots. Before taking a database snapshot, we attempt to lock it to prevent writes and ensure a consistent snapshot.
When do we warn you? We will warn you when we have failed to obtain a lock before a database snapshot.
What to do if you see this check? Contact Engine Yard Support if the source of your snapshots shows this alert. For example if you have moved your snapshots to the replica and we cannot lock it before a snapshot, it may mean you have no snapshots that are consistent and usable for recovery.
This alert is an indication that one of the connections to your database is leaving a transaction in an open state for an extended period of time. These locks may cause problems for other application actions that try to change the records that are locked.
- Warn: 15 minutes
- Fail: 30 minutes
The typical cause of this is an end-user with a Postgres session from the psql client or a third party query tool that has been left open with uncommitted changes. This can also be caused by an application process that performs other actions within the transaction scope; such as a long callback in after_save instead of after_commit.
While in progress you can obtain additional information about the session with:
select pid, datname,client_addr, client_port, waiting,
to_char(now()-query_start, 'DD HH24:MI:SS MS') as query_time, query from pg_stat_activity where state = 'idle in transaction' order by query_time desc;
If you log into the host indicated by the alert or by the command above and search for the port with `lsof` you can identify the pid of the process holding the lock. You can then use `ps aux | grep <pid>` to determine the actual server process, in a result like this:
UID PID PPID C STIME TTY TIME CMD
deploy 12751 12750 79 01:12 ? 02:29:24 sidekiq 3.4.2 product_manager [2 of 3 busy]