Παρασκευή, 23 Αυγούστου 2013

Backing up PostgreSQL to Hadoop

There are number ways to backup a PostgreSQL database, some are standard and some just demonstrate the power of open source and the things you can do if you put in use your creativity and imagination. At OmniTI, we use OmniPITR tool to manage WAL files and running backup on secondary databases instead of primary to reduce load during backup. In this post, I will discuss OmniPITR and Hadoop to accomplish something very neat, storing your backups into HDFS (Hadoop Distributed File System). 
You might be asking Why? HDFS is rock solid reliable, it has extremely low cost per byte and it can get 2Gbit per computer, scalable up to more than a TB per second. it is proven from internet giants for running a big variety of different use-cases.
Let's say that you have a 1TB database running, an uncompressed backup will need 1TB of reliable storage just to keep one copy. HFDS has the great advantage of using cheap hardware and being fault tolerant at the same time.
Imagine adding cheap SATA disks to the company workstations and keep the database backups there. Now, imagine explaining that to the next SAN salesman who will come to sell you storage. - FUN TIMES !

Let me give some details about tools used:


It's an open-source software for reliable, scalable, distributed computing.

The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

OmniPITR :

OmniPITR, written and maintained by OmniTI , is a set of scripts to ease setting up WAL replication, and making hot backups from both Master and Slave systems.
This set of scripts has been written to make the operation seamless, secure and as light on resources-usage as possible.

Now, You must be wondering, why am I talking about it now? Depesz has recently added a new feature called dst-pipes (pipe to program), this feature is limited only by the users imagination, Depesz as an example of this feature did gpg backups in one operation! i am going to use the same feature to do something entirely different.

Setup :

Software Versions: PostgreSQL 9.3 (beta2) , Hadoop 1.2.1

For the setup , I have used a VM for Hadoop and PostgreSQL, the reason for me using just one VM is that actually there are no limitations in the topology. Hadoop and Postgres are 2 entirely different ecosystems so you have no limitations, the PostgreSQL might be running on a cascading multi slave environment, While Hadoop is running as I said, on desktops. Other than Postgres and Haoop, you have to git clone OmniPITR . Depesz has put effort on making this self sufficient so no extra perl modules will be needed. The only thing that I have added to a minimal (net install) Debian installation was rsync.

Configuration :

In postgresql.conf, enable archiving and on the archive_command put something like :
archive_command = '/usr/local/omnipitr/bin/omnipitr-archive -dl /mnt/archives/xlog -s /usr/local/omnipitr/state -db /mnt/backup/xlog -l /usr/local/omnipitr/log/omnipitr-^Y-^m-^d.log -v "%p"'

Now, you may have a different archive_command, i really suggest using OmniPITR, it really works well and it has been tested in very large installation. you can do cool stuff like sending the archives to a remote location gzip them, make combinations  of local gzipped copies and remote etc. You could read about OmniPITR features in documentation on github.

Let's make a script:

#!/usr/bin/env bash
hadoop dfs -put - /user/hduser/$1

This script will allow Hadoop to get a file from stdin and put it into a HDFS namespace

Bring up postgreSQL with above archive_command and Hadoop running on the node.

Let's try to backup straight into Hadoop:

/usr/local/omnipitr/omnipitr-backup-master -D $PGDATA -l dl-backup.log  -dp ~/test/hadoop_bck.sh -x /mnt/backup/xlog -v

-dp is the key switch here, it uses the new feature of OmniPITR : 'dst-pipes' it will run the script and will output the backup as input to the script we just made.

To verify the backup is actually there :

 $hadoop dfs -ls /user/hduser

Found 2 items
-rw-r--r--   1 vasilis supergroup  115322880 2013-08-22 23:18 /user/hduser/dblab1-data-2013-08-22.tar
-rw-r--r--   1 vasilis supergroup   33566720 2013-08-22 23:18 /user/hduser/dblab1-xlog-2013-08-22.tar

You might want to refer to Hadoop documentation for how to manipulate these files, organize them etc. but the backup is there :

vasilis@dblab1:~/test$ hadoop dfs -copyToLocal /user/hduser/dblab1-data-2013-08-22.tar

vasilis@dblab1:~/test$ ls -l ~/dblab1-data-2013-08-22.tar
-rw-r--r-- 1 vasilis vasilis 115322880 Aug 23 18:06 /home/vasilis/dblab1-data-2013-08-22.tar

To restore the cluster, you would need something like :

hadoop dfs -cat /user/hduser/dblab1-data-2013-08-22.tar  |tar xv

For our example, we had no need for compression, gzip'ing the output is supported by OmniPITR and you may also do that on the script that pushes the tar to Hdfs

This is just the tip of the iceberg, just a proof of concept , the tools are there and everything is stable.

Use your imagination , save your company some money and be the hero !!!

Thanks for reading !!

Παρασκευή, 19 Ιουλίου 2013

Riak , testing backups or duplicating a cluster

This week i was asked to test a riak backup for a client, riak in terms of administration is really easy, backup means making a tarball with the following:
  • Bitcask data: /var/lib/riak/bitcask
  • LevelDB data: /var/lib/riak/leveldb
  • Ring data: /var/lib/riak/ring
  • Configuration: /etc/riak

Thats sounds simple enough and to be honest it is. But all backups have to be tested and that is something that it can be as easy as just "tar tvfz" or bringing up a second cluster with the backup data, there are some logical questions that come to mind here, i will point out some but there are probably more.
  1. A 2 node cluster means that you need 2 nodes to restore ?
  2. If i bring the cluster up who says that it wont join my primary cluster ?
  3. How can i test that the backup worked ?
Answers : 1. yes and no 2. i do, and i will show you how 3. same way you test your riak installation , by doing queries.

Now, im not an expert in Riak, i just touched it 3 days ago but for this specific scenario i am pretty confident.

lets say that you have a 2 node cluster which means 2 machines, the database consists in 2 different places right ? -right but you can bring 2 nodes up using one server. To do that you have to compile riak or else you will have one global configuration file and you wont be able to have 2 nodes in the same host.

to compile riak you will need the following packages with their dependences:
g++,make,gcc, erlang, erlang-dev, git

To compile , untar the downloaded source and type "make rel", this will give you a directory called rel , inside there you will find a directory called riak, you may copy that to riak1,riak2,riak-test,riak-dev etc, this directory has its own executables , config files and bitcask directories so it means that you can bring as many riaks up as your hardware allows.
Which bring us to the next question, lets assume that you have a 2 node single host cluster ready for startup, edit the 2 config files (app-config, vm.args)and fix the ips and the ports for each node BUT pay attention in vm.args to set the parameter  -setcookie right, by right i mean different name from the production and common between the 2 test nodes. setcookie value will allow or not the cluster to connect with another cluster so pay attention there.
now untar the backups into the new nodes: /root/riak-1.4.0/rel/riak1/data/bitcask for node1 and /root/riak-1.4.0/rel/riak2/data/bitcask for node2 and start riak, and now stop it, we have to reip the nodes, im not sure about this but i think that there are references into the data about the original node name,
to change that you must run : riak-admin reip <old_nodename> <new_nodename> and you started riak first to create some files called ring files, reip will change those files along with everything else needed and now you may start your nodes , join these 2 and play arround (if you dont know how , refer to riak site their documentation is very good)

to test riak use curl, in my tests i inserted data in both nodes on my wannabe production cluster and i did queries after i restored in known documents, riak documentation has quite some examples, if i put them here they would be copy paste from there.

i am by no chance a riak expert and if i did something wrong feel free to comment.

Thanks for reading

Παρασκευή, 12 Ιουλίου 2013

Implementing wata warehousing with Hadoop , HDFS and Hive (Part 1)

About one month ago, a big part of my life changed, i moved to London in order to work with a US based company that i know for quite a while for its work and for the great IT personalities that work there, OmniTI Consulting. Until then i didnt know much about BigData, actually i had no idea what this BigData gig was and why everyone was  talking about it. I tend to be traditional and these distributed filesystems , noSQL databases MapReduce functions seemed a bit strange and really unfamiliar to me. I kept saying that at some point RDBMS's will incorporate this functionality and they will slowly die ,something like ODBMS's , something like json as postgresql datatype etc..
Well, after reading a bit about hadoop i changed my mind -THIS THING IS AWESOME, those who share the same ideas with me about RDBM's , bare with me i will explain in a second why, and as i always do, with a real scenario that i had long time ago when i was working in a telco.

Let me tell you somethings about telcos, they have a HIGH need for data warehousing , realtime reporting and they have A LOT of data ! each call makes at least one record , called CDR (call detail record) saying basically who called whom, when, how much time did they talk etc. the result is a lot of database traffic, back in my day we had something like 1-2m rows daily, which means about 500m rows per year now it doesnt sound like much but it was 2002-2003 and the hard disks that i was buying for the servers were 18Gb back then.

Now imagine that you are the marketing guy and you wanna make tariffs, (call cost per destination). You would ask questions like , how many from that prefix called to that prefix ? which times, how long did they talk in average,how does this distributes in a week ? and so , so , so many more questions... My approach back then was to deploy a statistical server, sounds awesome right ? - ITS NOT, that server took part of the data and answered specific questions that were populating tables and of course reports, graphs , pies , charts and all the goods that marketing guys want to watch in daily basis. So, what happened when they wanted a report but for a non existing  timeframe? Thats easy, i fed the reporting server with data from the requested timeframe and the sql/bash scripts were doing their things. What happened if they wanted a new report ? I CRIED -but men dont cry.
DBA's cry !
I have evaluated products like IQServer and business objects, a good product but limited, fast , but maintenance was a pain in the ass, also expensive.. VERY expensive, so i decided to stay with my PostgreSQL , buy 3-4 more servers and distribute the job, basicaly if i had 1 months rows (CDR's) to process i'd give  10 days cdrs to each server and do the process in 1/3 of the time. a process that was so easy to go south, a slight mistake ment that everything had to run again , and making a mistake wasn't hard at all.

and that's the end of the story of what i did back in 2002 when i was young and tried to be awesome, now people have 1000 times the data that i had back then and they answering questions equally complex with the ones i had to answer in much , much less time. - How ?
the answer is MapReduce.

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. can be written in python , java , ruby etc, it basically consists in 2 functions the map and the reduce
map is preparing the data filter, sort them and reduce is summing them up. It's that simple ! in order to do that in a clustered environment you need nodes to play along in a common filesystem, in Hadoop world thats HDFS, Hadoop Distributed File System. So basically you can get raw data, db records or postgresql logs and parse them through mapreduce functions to get answers to questions. here's an example :

a map function :

#!/usr/bin/env python
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

a reduce function:
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()

    word, count = line.split('\t', 1)

        count = int(count)
    except ValueError:
    if current_word == word:
        current_count += count   
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

how it works:

echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | \
sort -k1,1 | /home/hduser/reducer.py
bar     1
foo     3
labs    1
quux    2 

now this code can run with no change in hadoop having many servers doing this in paralel with many MANY data... I can only imagine what i could do if i had this thing back in the days... To be continued with a useful (hopefully) example and a benchmark with one and 2 nodes hadoop cluster. the lab is ready i just need to make up an example and code a mapReduce

Thanks for reading
- Vasilis


Setting shared_buffers effectively.

One of the main performance parameters in PostgreSQL is shared_buffers, probably the most important one, there are guidelines and rules of thumb that say just set it to 20-30% of your machine total memory.
Don't get me wrong , these rules are generaly good and its a perfect starting point, but there are reasons why you should tune your shared buffers in more detail

a. 30% might not be enough and you will never know if you dont know exactly how to set this parameter
b. 30% might be a lot and you spend resources in vain.
c. you want to be awesome and tune every bit of your DB the optimal way.

What is shared buffers though ?
Shared buffers defines a block of memory that PostgreSQL will use to hold requests that are awaiting attention from the kernel buffer and CPU.
so basically PostgreSQL will put temporarily data blocks in the memory in order to process - EVERYTHING will go through the shared buffers.

Why not set shared_buffers to 80% of ram in a DB dedicated server ?
The OS also has cache, and if you set shared_buffers too high you will most likely have an overlap which called double buffering, having datablocks on both caches.

So, in order to set your shared_buffers you need to know what's happening inside shared memory.PostgreSQL has an implementation called clock-sweep algorithm, so everytime you use a datablock a usage counter is making that block harder to get rid of. the block gets a popularity number from 1-5 with 5 being heavily used and most likely it will stay in shared memory.
In theory you want the most popular data blocks in the shared buffers and the least popular ones out of it. To do that you have to be able to see what is inside the shared buffers, and thats exactly what pg_buffercache package does.You will find pg_buffercache in the contrib.

lets create 2 tables and full join them, update them and generaly run operations on these 2 tables while monitoring the buffer cache
I will give dramatic examples by setting shared_buffers too low , default and too high just to demonstrate what pg_buffercache views will show and then i i will find a good value for this specific workflow.i will run the same statements while i analyze what is happening inside shared buffers.
the to most very useful sql statements from the pg_buffercache views are the following :

-- buffers per relation and size
pg_size_pretty(count(*) * 8192) as buffered,
round(100.0 * count(*) /
(SELECT setting FROM pg_settings
WHERE name='shared_buffers')::integer,1)
AS buffers_percent,
round(100.0 * count(*) * 8192 /
AS percent_of_relation
FROM pg_class c
INNER JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.oid,c.relname

-- buffers per usage count
c.relname, count(*) AS buffers,usagecount
FROM pg_class c
INNER JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname,usagecount
ORDER BY usagecount,c.relname

Τετάρτη, 27 Μαρτίου 2013

PostgreSQL backup and recovery

One of the main tasks of any kind of administrator is to make sure that the data that he's responsible for will be available if anything bad happens (asteroids, flood, locusts, hail) in order to do that you will need a high availability solution to obtain continuity and a good backup plan to support your system in case of human error ( -Hello Mr. admin, i just dropped the customer payment table) trust me, it happens ... a lot...
I briefly covered high availability with the implementation of a hot standby in an earlier post and now its time to cover backup options available in PostgreSQL.

There are 2 kinds of backups in PostgreSQL, physical and logical.

The good thing with logical backups is that they are simple to implement and maintain, selective backup and restore (even in later PG versions) is easy. Usually the backup output consists in one file that of course can be compressed. the major con of this method is that lacks Point In Time Recovery. (A database with the PITR feature can be restored or recovered to the state that it had at any time since PITR logging was started for that database.) it also lacks incrementality (each backup is a new WHOLE backup of your database). This makes these kinds of backups not really usable in very large production installations. Before you decide to say, "Not good" let me tell you cases that these kinds of backups would be better than incremental backups.
Test servers with not much data, small installations that the output will be small and the backups can be taken in a daily (or even less) basis. Large installations that don't have many data changes over the day. Data Warehouses and reporting servers.

once again PGlab1 will be the my lab rat , don't worry, No Animals or Data Were Harmed


export backupfile=""backup"_`date "+%d%m%Y"`.tgz"
pg_ctl stop && tar cvfz ~/$backupfile $PGDATA && pg_ctl start

Basicaly i just stopped the database , took a tgz of my $PGDATA and started the DB again.
simple and effective, restore can be done on a different path than $PGDATA, just make sure you provide -D on pg_ctl or set the PGDATA to the correct path before you start.

pg_dump and pg_dumpall

pg_dump and pg_dumpall, exports one or all databases to a (by default) human readable sql format.
it can be compressed by default and it supports a lot of options like data only , schema only etc.

i have a database for this posts sake called testbackup

if i run pg_dump testbackup i will get on stdout one by one the sql commands that i would need to remake the database from scratch so just by redirecting it to a file you have a simple backup. i wont get into details about the format of this file , i will just say that at first you will see connection details like the encoding and the extensions that exist , then the table creation script, then the data (using a postgres command called COPY) and then the indexes and constrains.
NOTE that taking a pg_dump wont backup the users, and thats because users in postgres are global and they exist in postgres database. to backup users you can use pg_dumpall -g (-g means globals).

Here's a script that i am using in order to take this kind of backup :

export PGDATA=/opt/db/data
export PGPASSFILE=/opt/db/data/pgpass.conf

usersfilename="`date "+%d%m%Y"`.bck.users"
$pgdumpall -g -U $psqluser >  $backupdir/$usersfilename && find $backupdir/*bck.users* -ctime +$retention -exec rm {} \;

for db in `echo "select datname from pg_database
where datname not in ('template0','template1','postgres');
"|$psql -A -t postgres`
backupname=""$db"_`date "+%d%m%Y"`.bck"
logfilename=""$db"_`date "+%d%m%Y"`.bck.log"
usersfilename=""$db"_`date "+%d%m%Y"`.users"
$pgdump -Fc -v -f $backupdir/$backupname -U $psqluser $db 2> $backupdir/$logfilename && find $backupdir/$db*bck* -ctime +$retention -exec rm {} \;

Notice that i use the -Fc switch in pg_dump, that means custom format, and it can be used for selective restore using the pg_restore command. if i had one of these backups and i wanted to restore the table "customers" i would run :
pg_restore -Fc -t customers -f <file name> -U <username> -h < host name> -d <db name>
NOTE that there is a switch (-j) for parallelism.
more about pg_restore pg_dump and pg_dumpall on : pg_restore , pg_dump , pg_dumpall

Now that we are done with database dump backup basics , lets move to live , or online backup , PITR and timelines.
In order to get a backup that is incremental you will need a basebackup and all the changes that transactions do to the database, so you need the transaction logs or as we call them in postgres WAL segments. I wont say many things about how transaction mechanism works in postgres, this is a backup and restore post so i will leave WAL mechanism for another post.

Standalone hot physical database backup

I will use the following directories and variables for examples sake

export BACKUPNAME=""backup"_`date "+%d%m%Y"`.tgz"

postgres@pglab1:/opt$ ls -l
total 8
drwxr-xr-x  2 postgres postgres 4096 Mar 27 12:32 BACKUP
drwx------ 15 postgres postgres 4096 Mar 27 11:40 PGDATA

mkdir /opt/BACKUP/archives

Set an archive_command. In postgresql.conf  and restart the server:
wal_level = archive 
archive_mode = on
archive_command = 'test -f /opt/BACKUP/archiving/archiving_active && cp %p /opt/BACKUP/archive/%f'

mkdir /opt/BACKUP/archiving/
touch /opt/BACKUP/archiving/archiving_active

now run :
psql -c "select pg_start_backup('BACKUP')"
tar -cvzf --exclude=$PGDATA/pg_xlog -f ../BACKUP/$BACKUPNAME $PGDATA
psql -c "select pg_stop_backup(), current_timestamp"

now, lets crash and restore

rm -rf /opt/PGDATA/*  (yoohoo !!!)
untar the backup (.tgz) in $PGDATA , you should miss pg_xlog dir , create it as postgres user
then on $PGDATA edit a file called recovery.conf and add :
restore_command = 'cp /opt/BACKUP/archive/%f %p'

start the database and watch the logfile, it should show something like :

2013-03-27 13:22:58 EET::@:[3047]: LOG:  archive recovery complete
2013-03-27 13:22:58 EET::@:[3045]: LOG:  database system is ready to accept connections
2013-03-27 13:22:58 EET::@:[3061]: LOG:  autovacuum launcher started

the recovery.conf will also be automatically renamed to recovery.done.

Hot physical backup & Continuous Archiving

Now this is what you would want for a mission critical production installation with a lot of GBs or Tbs of data and a lot of concurrent users hitting the DB 24/7.
For examples sake i will delete my whole cluster and make the steps one at the time the backup will be taken locally something that of course is not suggested, and at the end i will perform a PITR and i will also talk about timelines.

 edit postgresql.conf and enable archiving :

wal_level = archive
archive_mode = on 
archive_command = 'cp %p /opt/BACKUP/archives/%f' 
(NOTE that archive_command can be scp, a more advanced external script or anything that would transfer the archived WALs to the desired location)
restart the server
psql -c "select pg_start_backup('my backup')"

you can now tar , rsync or whatever you want to another node, something like 
"rsync -cva --inplace --exclude=*pg_xlog* ${PGDATA}$OTHERNODE:$BACKUPNAME/$PGDATA"
would work

for my example, i will just use tar like the previous example:
tar -cvz --exclude=/opt/PGDATA/pg_xlog/ -f /opt/BACKUP/backup.tgz $PGDATA

psql -c "select pg_stop_backup(), current_timestamp"

At this moment i have a base backup , and the mechanism that archives all wal segments, lets add some data and force some checkpoints.

notice that the archives directory now has WALs
postgres@pglab1:/opt/PGDATA/pg_xlog$ ls -l /opt/BACKUP/archives/
total 49156
-rw------- 1 postgres postgres 16777216 Mar 27 13:57 000000010000000000000001
-rw------- 1 postgres postgres 16777216 Mar 27 14:02 000000010000000000000002
-rw------- 1 postgres postgres      293 Mar 27 14:02 000000010000000000000002.00000020.backup
-rw------- 1 postgres postgres 16777216 Mar 27 14:04 000000010000000000000003

a WAL segment is happening either on size or time threshold, with the default postgresql.conf values that means on 16Mb or every 5 minutes, whatever happens first. Both parameters can be, and should be changed for performance's sake depending on your workload, so monitor checkpoint frequency.

now lets say that something really bad happened, like a mistaken but commited update on the table backup to make it easier for me i created that table with a datetime column with default value now().
so we have :

         datetime          | count
 2013-03-27 14:05:05.999257 |  1000
 2013-03-27 14:05:14.911462 |  1000
 2013-03-27 14:05:19.419173 |  1000
 2013-03-27 14:05:25.631254 |  1000
 2013-03-27 14:06:39.97177   |  1000
 2013-03-27 14:09:53.571976 |  1000

Lets also assume that we know that the update was recorded at 2013-03-27 14:05:25.631254 and we want the database back to that exact time.

edit a recovery.conf as we did before :
restore_command = 'cp /opt/BACKUP/archives/%f %p'
recovery_target_time = '2013-03-27 14:04:00'

and restart the db, check the logfile , you'll see something like :
LOG:  starting point-in-time recovery to 2013-03-27 14:04:00+02

Now lets Recover from a crush
once again , rm -rf /opt/PGDATA/*

untar the basebackup , place recovery.conf with or without the recovery_target_time
and start the database.
in my example i also did a PITR to 2013-03-27 14:01:00
and the table now has :
 1 | 2013-03-27 13:56:49.163269


PostgreSQL documentation describes timelines much better than i could. so here it is right from the documentation :
The ability to restore the database to a previous point in time creates some complexities that are akin to science-fiction stories about time travel and parallel universes. For example, in the original history of the database, suppose you dropped a critical table at 5:15PM on Tuesday evening, but didn't realize your mistake until Wednesday noon. Unfazed, you get out your backup, restore to the point-in-time 5:14PM Tuesday evening, and are up and running. In this history of the database universe, you never dropped the table. But suppose you later realize this wasn't such a great idea, and would like to return to sometime Wednesday morning in the original history. You won't be able to if, while your database was up-and-running, it overwrote some of the WAL segment files that led up to the time you now wish you could get back to. Thus, to avoid this, you need to distinguish the series of WAL records generated after you've done a point-in-time recovery from those that were generated in the original database history.
To deal with this problem, PostgreSQL has a notion of timelines. Whenever an archive recovery completes, a new timeline is created to identify the series of WAL records generated after that recovery. The timeline ID number is part of WAL segment file names so a new timeline does not overwrite the WAL data generated by previous timelines. It is in fact possible to archive many different timelines. While that might seem like a useless feature, it's often a lifesaver. Consider the situation where you aren't quite sure what point-in-time to recover to, and so have to do several point-in-time recoveries by trial and error until you find the best place to branch off from the old history. Without timelines this process would soon generate an unmanageable mess. With timelines, you can recover to any prior state, including states in timeline branches that you abandoned earlier.
Every time a new timeline is created, PostgreSQL creates a "timeline history" file that shows which timeline it branched off from and when. These history files are necessary to allow the system to pick the right WAL segment files when recovering from an archive that contains multiple timelines. Therefore, they are archived into the WAL archive area just like WAL segment files. The history files are just small text files, so it's cheap and appropriate to keep them around indefinitely (unlike the segment files which are large). You can, if you like, add comments to a history file to record your own notes about how and why this particular timeline was created. Such comments will be especially valuable when you have a thicket of different timelines as a result of experimentation.
The default behavior of recovery is to recover along the same timeline that was current when the base backup was taken. If you wish to recover into some child timeline (that is, you want to return to some state that was itself generated after a recovery attempt), you need to specify the target timeline ID in recovery.conf. You cannot recover into timelines that branched off earlier than the base backup.

I know that this probably needs a review, and the plan is to do it at some point.
Thanks for reading

Τρίτη, 26 Φεβρουαρίου 2013

Migrating from oracle to postgres (part 2)

As I said some days ago, I have an oracle dump from a database that i want to migrate to postgres for a project. I setup a 100gb VM, I setup oracle 11gr2 and i compiled postgres 9.2.3 on the same VM. i restored the dump to oracle and now i had to make a migration strategy.
I had some ideas like getting the schema from the dump with expdp (sqlfile=) and manually translate it to postgres but i remembered a project that i saw some time ago called ora2pg. This cool thing is basically a perl program that needs except of perl (duh)  DBI and DBD for oracle and postgres. i set it up and i started experimenting first on the DDL's. the output was actually very good ! the only things that i had to change was some function based indexes and it was parse-able from postgres without problems, next was data. No problems there , it created a 20gb output file that all i had to do was to throw it to postgres. Just because i wasn't very proactive with my VM's disk space i faced some problems... 20Gb from the export plus the oracle data
(36664764 /opt/oradata/) + oracle software + other things ?
that was about 65% of my diskspace, and i had to also maintain a second live copy of these data+indexes on postgres.. So i gzipped the whole output file and i used "zcat |psql". worked like charm ,actually its still running as i write this because ...
And here are somethings to take under consideration before starting!
the dumpfile is 20gb, what happens if something goes wrong ?
You drop the database fix the ddl script and reapply the data
Do it one object at the time, not like i did or at least separate the small from the big tables in groups and do the big tables one at the time, or something like :
ERROR: invalid input syntax for integer: "26.19"
CONTEXT: COPY twitter_tmp, line 1, column network_score: "26.19"

might end up fucking you up !
ora2pg translated a number column as integer , something that came back to bite me in the ass 2 hours after import run...
i changed the ddl script and i rerun it, if it fails again i have to start over.. SO, don't be like me , act smart, do it one part at the time :)
in all fairness i still have no idea what tables are in the database and my job has been made easy mode with ora2pg. So.. i cant blame it, i can only blame me for not MAKING A MIGRATION STRATEGY afterall.
i will also show you the parameters that i have changed in order to make the import faster, the difference is HUGE actually..

shared_buffers = 2GB
synchronous_commit = off
wal_buffers = 16MB
wal_writer_delay = 1000ms
checkpoint_segments = 64  

Anyway so far i think i'm good with the table definitions (including constraints) and the data (hopefully) but there is a lot more that have to be done, views , code etc...

thanks for reading
to be continued...

Πέμπτη, 21 Φεβρουαρίου 2013

PostgreSQL Partitioning

Partitions are very usable when it comes to big tables, documentation suggests applying table partitioning when a table is bigger than 10Gb. In postgres there are 2 kinds of partitions 
  • Range
  • List
implementation will just need the following steps.
  1. Enable constraint exclusion in config file
  2. Create a master table
  3. Create child tables WITHOUT overlapping table constraints
  4. Create indexes , pk's
  5. Create function and trigger to insert data to child tables
 First thing to notice here is that partitioning is basically using table inheritance. but enough with how things are working in theory, lets create one.

First , check into postgresql.conf for this parameter:

constraint_exclusion = partition 

Now , lets create the master and child tables :

    sales_id serial NOT NULL,
    description text

CREATE TABLE sales_2013_p1 (
CHECK ( sales_date >= DATE '2013-01-01' AND sales_date < DATE '2013-05-01' )
 ) INHERITS (sales);

 CREATE TABLE sales_2013_p2 (
CHECK ( sales_date >= DATE '2013-05-01' AND sales_date < DATE '2013-09-01' )
 ) INHERITS (sales);

 CREATE TABLE sales_2013_p3 (
CHECK ( sales_date >= DATE '2013-09-01' AND sales_date < DATE '2014-01-01' )
 ) INHERITS (sales);

notice the keyword INHERITS here :)
next, PK's and indexes on child tables,

ALTER TABLE sales_2013_p1 ADD CONSTRAINT sales_2013_p1_pkey PRIMARY KEY (sales_id, sales_date);
ALTER TABLE sales_2013_p2 ADD CONSTRAINT sales_2013_p2_pkey PRIMARY KEY (sales_id, sales_date);
ALTER TABLE sales_2013_p3 ADD CONSTRAINT sales_2013_p3_pkey PRIMARY KEY (sales_id, sales_date);

CREATE INDEX idx_2013_p1 ON sales_2013_p1 (sales_date);
CREATE INDEX idx_2013_p2 ON sales_2013_p2 (sales_date);
CREATE INDEX idx_2013_p3 ON sales_2013_p3 (sales_date);

and finaly a function that returns trigger and the on-insert trigger itself.

    IF ( NEW.sales_date >= DATE '2013-01-01' AND NEW.sales_date < DATE '2013-05-01' ) THEN
        INSERT INTO sales_2013_p1 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-05-01' AND NEW.sales_date < DATE '2013-09-01' ) THEN
        INSERT INTO sales_2013_p2 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-09-01' AND NEW.sales_date < DATE '2014-01-01' ) THEN
        INSERT INTO sales_2013_p3 VALUES (NEW.*);
        RAISE EXCEPTION 'Date out of range.!';
    END IF;
LANGUAGE plpgsql;

CREATE TRIGGER insert_on_sales
    FOR EACH ROW EXECUTE PROCEDURE sales_trig_func();

Now that we have a table with a basic partitioning schema, lets assume that we want to add more partitions for 2014. create a new child table (for examples sake) i will just create a partition for 2014.

CREATE TABLE sales_2014 (
CHECK ( sales_date >= DATE '2014-01-01' AND sales_date < DATE '2015-01-01' )
 ) INHERITS (sales);

ALTER TABLE sales_2014 ADD CONSTRAINT sales_2014_pkey PRIMARY KEY (sales_id, sales_date);

CREATE INDEX idx_2014 ON sales_2014 (sales_date);

    IF ( NEW.sales_date >= DATE '2013-01-01' AND NEW.sales_date < DATE '2013-05-01' ) THEN
        INSERT INTO sales_2013_p1 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-05-01' AND NEW.sales_date < DATE '2013-09-01' ) THEN
        INSERT INTO sales_2013_p2 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-09-01' AND NEW.sales_date < DATE '2014-01-01' ) THEN
        INSERT INTO sales_2013_p3 VALUES (NEW.*);

    ELSIF ( NEW.sales_date >= DATE '2014-01-01' AND NEW.sales_date < DATE '2015-01-01' ) THEN
        INSERT INTO sales_2014 VALUES (NEW.*);

        RAISE EXCEPTION 'Date out of range.!';
    END IF;
LANGUAGE plpgsql;

and we are done!

Now lets say that 2013_p1 data are obsolete and we want to move them to a historical database, drop table, correct and replace the function and you are done.
 This is how the master table would look after these operations :

partition=# \d+ sales
                                                   Table "public.sales"
   Column    |  Type   |                        Modifiers                                               | Storage  | Stats target | Description
 sales_id    | integer | not null default nextval('sales_sales_id_seq'::regclass) | plain    |              |
 sales_date  | date    | not null default ('now'::text)::date                                  | plain    |              |
 description | text    |                                                                                         | extended |              |
    insert_on_sales BEFORE INSERT ON sales FOR EACH ROW EXECUTE PROCEDURE sales_trig_func()
Child tables: sales_2013_p2,
Has OIDs: no

Good thing about this approach is that partitions are generally easy to maintain and administer, child tables can have different indexes with each other, and you can of course delete large portions of data that may not be needed any more just by dropping a partition, uh ! the performance is VERY good on insert and select and maintenance work like reindex is faster. Reindex in particular wouldn't lock the whole master table for writes. 

Thanks for reading

Τρίτη, 19 Φεβρουαρίου 2013

oracle 11gr2 to postgres 9.2.3 (part 1)

A friend of mine gave me an export (expdp) of an oracle 10g database, its almost 4gigz compressed so i reckon its about 12-14gb, the job is to migrate it to postgres... The plan is to create a VM running 11gr2 on oracle linux, compile postgres on the same VM and start migrating. Currently i have a couple of ideas on how to do it but the whole thing depends on what code functions, stored procedures the export contrains, from the export log i saw a quite complicated schema...

to be continued...

proper installation

Recently at work, i got assigned to upgrade some (about 10)
postgres 8.1.x to 9, I always liked compiling basically because I like the flexibility that compile offers, and thats what i proposed to the guys that are in charge of the project. They gave me a test system (vm) to play with, in all fairness they were a bit skeptical with the idea of compiling the rdbms. Mostly for reliability issues (don't ask me why). I explained that upgrading from source would be much easier later on and that the last year PostgreSQL developers are doing so much work that it cannot be ignored (pg_basebackup for example).Since the latest centos package was 9.1.x they agreed and i started working.
PostgreSQL is so easy to compile, no strange dependencies not many libraries needed. and about these reliability concerns ? Because developers are so neat, and i quote from

"While upgrades always have some risk, PostgreSQL minor releases fix only frequently-encountered security and data corruption bugs to reduce the risk of upgrading. The community considers not upgrading to be riskier than upgrading."

You should never find an unexpected change that breaks an application in a minor PostgreSQL upgrade. Bug, security, and corruption fixes are always done in a way that minimizes the odds of introducing an externally visible behavior change, and if that's not possible, the reason why and the suggested workarounds will be detailed in the release notes. What you will find is that some subtle problems, resulting from resolved bugs, can clear up even after a minor version update. It's not uncommon to discover a report of a problem to one of the PostgreSQL mailing lists is resolved in the latest minor version update compatible with that installation, and upgrading to that version is all that's needed to make the issue go away.

so don't be afraid to upgrade , DO IT

If you want to compile postgres you will need (except from the basic development tools) the following 2 libraries :

zlib1g-dev (compression library needed if you want to compress directly from postgres, pg_dump -Z for example)

libreadline6-dev (used by the client to support history on psql)

You could call both these libraries optional and you actually can compile postgres without them, but meh , don't do that...

Other than that its pretty much straight forward,
make install (check the documentation for extra flags , change of prefix etc)

postgres will install the binaries by default in /usr/local/pgsql/
Packages from the other hand will install in /usr/bin (MEGA SUPER LAME)

SO ! lets see the current situation. I inherited a postgresql server that had a package installation, the binaries located in /usr/bin and postgres home in /var/lib.
If it was compiled i could go to /usr/local/pgsql , rename it to /usr/local/pgsql8 remove it from the path and let it be, but yeah, i couldn't do that, i fixed the path, put /usr/local/pgsql before /usr/bin so the first postgres binaries in path order were 9's so i was ok. But still ! the packages tend to inter-grade deep into the OS making any other option (like running 2 postgres servers at the same time) more difficult that it should be.

I like adding some of the contrib packages into postgres , for logging, monitoring and tuning reasons pg_buffercache, pgbench, pg_stat_statements, pg_freespacemap are some examples.
In a compiled environment there is nothing easier than that, nothing extra to download not much to do, just compile the contrib module and add it to the database with CREATE EXTENSION.
Now lets say that after a while a new release of postgres comes and i wanna upgrade to that just because i want to stay current or because a new wow feature was added or a bug was resolved, all i have to do is compile the new version (assuming its a minor release) replace the binaries and restart the server. Package installed postgres server would have to wait till the distribution released a new package of the new server usually a few versions behind current, which means what ? a month after the release date? maybe more!?

Basically the only thing that i can imagine being easier with a package installation are the init scripts, and yeah , ok i don't think that is a huge pro comparing to a compiled version.

SO my suggestion is , compile the damn thing.. COMPILE !

Thanks for reading

Δευτέρα, 18 Φεβρουαρίου 2013

binary replication in practice

 A couple of days ago I started making a short howto about streaming replication in PostgreSQL 9.2. Most of these things are well documented but in this howto i will also try to experiment with switch overs and switchbacks. It aims to show how easy it is to set it up right out of the box.

Streaming replication PostgreSQL 9.2

For my example i will use 2 debian VM's (pglab1) and (pglab2)

- not mandatory -

exchange ssh keys for passwordless ssh ,might be used if we need to scp scripts , rsync or whatever.

if you don't know how to do it, follow these steps :

after you create a postgres cluster using initdb,
edit master's postgresql.conf and change the following :
listen_addresses = '*'
wal_level = hot_standby #(could be archive too)
max_wal_senders = 5
hot_standby = on

create a replication user :
create user repuser replication password 'passwd';

edit pg_hba.conf and add :

host all all trust
host replication repuser md5


Now ssh to slave and from $PGDATA run :

pg_basebackup -D /opt/data/ -v -h -U repuser
 enter password , this will transfer a full copy of your cluster from your master. check the documentation of pg_basebackup for compression and other options available.

In $PGDATA edit a file called recovery.conf containing :
standby_mode = on
 primary_conninfo = 'host= port=5432 user=repuser password=passwd'

with the master up, start slave , it should say :
LOG: database system is ready to accept read only connections

At this point you will have 2 nodes running with your master accepting read/write operations and your slave accepting only read only operations (reporting goes here maybe ?) now, lets say the master crashes and you need to failover and promote slave as the new master.


shutdown master 
on slave, execute:
pg_ctl promote

that's it, your slave (pglab2) is now master accepting all kinds of connections.

now lets say that the ex-master (pglab1) , is fixed and is ready to come up again,

- Switch back to original Master -

on the current master (pglab2) :

echo "select pg_start_backup('clone',true);" |psql pgbench
rsync -av --exclude postgresql.pid /opt/data/*
echo "select pg_stop_backup();"|psql pgbench

this will sync all data from my current master (pglab2) to my current -to be- slave (pglab1), should be currently down.

edit recovery.done and fix the ip of the current master
rename recovery.done to recovery.conf
start ex-master (pglab1), now as slave , promote it with the slave (pglab2) down,
recreate slave (pglab2) with rsync, edit recovery.conf and start it again, 
the servers now have their original roles.

note that you can have 2 master databases but this might (and probably will) create a mess, so be sure that you bring down the correct server before promoting.

if you want to check if a server is acting as master or slave , run :

select pg_is_in_recovery()
If it's true, you're on a slave.

Thanks for reading.

Here we are

Hello, my name is Vasilis. I am a database admin for 10 years now, and in this blog i will try to share my database "adventures", post howto's and review database products, news, releases etc.
This blog will be mainly about PostgreSQL, my (by far) favorite RDBMS
but i might aswell write about oracle and ms-sql server.