Backup strategies on Postgres

March 25 2017

If you are using any database in production you should always have backups. Servers can crash, disks can fail, data can get corrupted, people can make mistakes; a lot of failures can happen with the production instance. Hence it is extremely important to have backups on a production system.

In this post I’m going to discuss various backup options that are available on PostgreSQL and the various ways one can configure taking backups of the database in production. I’ll also discuss in what situation what kind of backups are more useful.

First, let’s look at the various backup options available in Postgres:

Logical backups
Physical backups
Continuous backups

Let’s look at each of these below.

Logical backups

Logical backups are dumps of SQL statements which describe the state of the database. They can be obtained using a tool like pg_dump. These dump files are just SQL files. Hence, they can be imported to another postgres instance using psql normally. It is usually a good idea to take one logical backup before altering schema or migrating a database.

$ pg_dump -U admin -h db.example.com mydatabase > dump.sql

The file dump.sql is your backup. To recover it, run this in another database:

$ psql -U admin -h db2.example.com mydatabase < dump.sql

-U takes the postgres username. Port can also be specified via -p.

Physical backups

Physical backups are backups via the filesystem. Meaning they can be snapshots of the filesystem itself, or they could be just copies of the Postgres database’s data directory. The entire database instance’s state is stored in Postgres’ data directory; which is usually /var/lib/postgresql/<VERSION>/data on *nix machines.

You can use rsync to periodically take backup of this data directory, and then store the backup elsewhere.

As an example:

$ tar czf mydb-pgdata-21022017-1803.tar.gz  /var/lib/postgresql/9.6/data
$ rsync --progress --partial --recursive mydb-pgdata-21022017-1803.tar.gz \
    backup@backup-server.example.com:/vol/pgdata_backups/

We first make a compressed (gzipped) tarball of the data directory, and then use rsync to copy the files to our backup server.

Continuous backups

Continuous backups are an advanced feature of Postgres. Postgres database has this design where it first commits a transaction a file on the disk, called the Write-ahead Log (or WAL), and waits for the data to be actually written to the disk; and only then commits the transaction to the current database object. This way even if a Postgres server crashes in the middle of a transaction, it will never have inconsistent state and when it recovers it can replay the WAL to arrive at the state it was when it crashed.

If you think about it, this amazing WAL design results in a uber-amazing backup system. These WAL logs can simply copied or streamed over to another machine where -

it can be kept as a backup - and this enables us to recover a Postgres database in any point in time. Just ask Postgres replay to that point in the WAL where you fancy. This also called a Point-in-time recovery (PITR) in DB world.
it can be replayed to a secondary Postgres instance and then this instance can be used as a slave, a replicated instance, a backup or even a secondary read-only instance.

Using the WAL, a lot of interesting things can be done like streaming replication, hot standby failover instance, read-only slaves and few more. Maybe I will write about these interesing bits in a future post.

Configuring continous backups is little more involved and probably will require its own post and hence I will not cover configuration of continous backups here. You can search about it on the internet. But I can suggest using WAL-E, which is a fantastic tool that makes setting up continous backup very easy. Once you install and configure WAL-E it can stream your backup files to S3, Google bucket, Azure etc.

But which backup to use where?

Now, the natural next question is which backup method to use. Should someone use all the possible methods, or sticking to one method is good enough?

It can be confusing or daunting to look at so many options. Hence, I’m here to nudge you in the right direction with the right ideas. But don’t take my ideas for granted. Do your own research, think through your problems, and come up with your own conclusion.

Almost in all situations continuous backup makes sense. It offers a streaming, continuous backup of your production database and it also offers Point-in-time-recovery. PITR can be very valuable, because it let’s you specify a timestamp and recover to it. Although, setting up backup and recovery mechanism in this method is little more complicated - as there can be few edge cases, specially regarding the process you will adapt to do recovery. Hence, this method requires a bit of planning and hence takes more effort overall. But probably also pays the best in the long run.

Filesystem level backups are error-prone, there is no guarantee that you get a consistent snapshot. What if the moment when you were taking the snapshot of the underlying filesystem, the database decided to write some new data. There is no way for you to control that. Also by moving around the filesystem for our backups, we are essentially not leveraging what the database can do for us. Hence, for most of the cases filesystem backups can be ignored. If you are really paranoid and want to keep redundant backups, sure go ahead and setup a cronjob to rsync your PGDATA directory somewhere safe.

Logical backups give you a logical snapshot of your database. Logical snapshots are tremendously useful if you are migrating your database. Its always a good idea when you are migrating a schema, that you take a logical backup before migrating. It is also useful for debugging purposes. You can look at two different logical dumps and compare them to figure out data discrepancies. It’s not a bad idea to add a cronjob which takes a pg_dump periodically, along with your continuous backup configuration. That way you have a logical snapshot of your database and you have redundant backups. Win-win situation!

On closing note, another important aspect of database backup and recovery process is to always validate and check that current backups are happening and are working - i.e from existing backups database can be recovered. In the light of the recent Gitlab incident this has become even more apparent, that systems fail and assumptions don’t hold true always. Do not rely on the fact that you have setup the system correctly 2 years back. As an DBA or admin it is not just enough to setup a backup system, one should regularly monitor the backups and also perform periodic drills, where they take one of the latest backups and try to recover from it.

Backup systems are generally long running systems and people usually forget about them right after configuring them; because they do not need to touch that system in a long time. But eventually the day comes when they have recover from the backups, and they can only hope that their backups are safe and somehow magically working even if they have gone throuhg system/software failures, hardware failure, data corruption and what not.

Hence knowing the status of the backups can make the difference of a world to the application because after all the most important part of any application is the database.

rayanon

Musings of a human existence