Παρασκευή, 23 Αυγούστου 2013

Backing up PostgreSQL to Hadoop


There are number ways to backup a PostgreSQL database, some are standard and some just demonstrate the power of open source and the things you can do if you put in use your creativity and imagination. At OmniTI, we use OmniPITR tool to manage WAL files and running backup on secondary databases instead of primary to reduce load during backup. In this post, I will discuss OmniPITR and Hadoop to accomplish something very neat, storing your backups into HDFS (Hadoop Distributed File System). 
You might be asking Why? HDFS is rock solid reliable, it has extremely low cost per byte and it can get 2Gbit per computer, scalable up to more than a TB per second. it is proven from internet giants for running a big variety of different use-cases.
Let's say that you have a 1TB database running, an uncompressed backup will need 1TB of reliable storage just to keep one copy. HFDS has the great advantage of using cheap hardware and being fault tolerant at the same time.
Imagine adding cheap SATA disks to the company workstations and keep the database backups there. Now, imagine explaining that to the next SAN salesman who will come to sell you storage. - FUN TIMES !

Let me give some details about tools used:

Hadoop:

It's an open-source software for reliable, scalable, distributed computing.

The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

OmniPITR :

OmniPITR, written and maintained by OmniTI , is a set of scripts to ease setting up WAL replication, and making hot backups from both Master and Slave systems.
This set of scripts has been written to make the operation seamless, secure and as light on resources-usage as possible.

Now, You must be wondering, why am I talking about it now? Depesz has recently added a new feature called dst-pipes (pipe to program), this feature is limited only by the users imagination, Depesz as an example of this feature did gpg backups in one operation! i am going to use the same feature to do something entirely different.

Setup :

Software Versions: PostgreSQL 9.3 (beta2) , Hadoop 1.2.1

For the setup , I have used a VM for Hadoop and PostgreSQL, the reason for me using just one VM is that actually there are no limitations in the topology. Hadoop and Postgres are 2 entirely different ecosystems so you have no limitations, the PostgreSQL might be running on a cascading multi slave environment, While Hadoop is running as I said, on desktops. Other than Postgres and Haoop, you have to git clone OmniPITR . Depesz has put effort on making this self sufficient so no extra perl modules will be needed. The only thing that I have added to a minimal (net install) Debian installation was rsync.

Configuration :

In postgresql.conf, enable archiving and on the archive_command put something like :
archive_command = '/usr/local/omnipitr/bin/omnipitr-archive -dl /mnt/archives/xlog -s /usr/local/omnipitr/state -db /mnt/backup/xlog -l /usr/local/omnipitr/log/omnipitr-^Y-^m-^d.log -v "%p"'

Now, you may have a different archive_command, i really suggest using OmniPITR, it really works well and it has been tested in very large installation. you can do cool stuff like sending the archives to a remote location gzip them, make combinations  of local gzipped copies and remote etc. You could read about OmniPITR features in documentation on github.

Let's make a script:

#!/usr/bin/env bash
hadoop dfs -put - /user/hduser/$1

This script will allow Hadoop to get a file from stdin and put it into a HDFS namespace

Bring up postgreSQL with above archive_command and Hadoop running on the node.

Let's try to backup straight into Hadoop:

/usr/local/omnipitr/omnipitr-backup-master -D $PGDATA -l dl-backup.log  -dp ~/test/hadoop_bck.sh -x /mnt/backup/xlog -v

-dp is the key switch here, it uses the new feature of OmniPITR : 'dst-pipes' it will run the script and will output the backup as input to the script we just made.


To verify the backup is actually there :

 $hadoop dfs -ls /user/hduser

Found 2 items
-rw-r--r--   1 vasilis supergroup  115322880 2013-08-22 23:18 /user/hduser/dblab1-data-2013-08-22.tar
-rw-r--r--   1 vasilis supergroup   33566720 2013-08-22 23:18 /user/hduser/dblab1-xlog-2013-08-22.tar

You might want to refer to Hadoop documentation for how to manipulate these files, organize them etc. but the backup is there :

vasilis@dblab1:~/test$ hadoop dfs -copyToLocal /user/hduser/dblab1-data-2013-08-22.tar

vasilis@dblab1:~/test$ ls -l ~/dblab1-data-2013-08-22.tar
-rw-r--r-- 1 vasilis vasilis 115322880 Aug 23 18:06 /home/vasilis/dblab1-data-2013-08-22.tar

To restore the cluster, you would need something like :

hadoop dfs -cat /user/hduser/dblab1-data-2013-08-22.tar  |tar xv

For our example, we had no need for compression, gzip'ing the output is supported by OmniPITR and you may also do that on the script that pushes the tar to Hdfs

This is just the tip of the iceberg, just a proof of concept , the tools are there and everything is stable.

Use your imagination , save your company some money and be the hero !!!

Thanks for reading !!