Παρασκευή 19 Ιουλίου 2013

Riak , testing backups or duplicating a cluster

This week i was asked to test a riak backup for a client, riak in terms of administration is really easy, backup means making a tarball with the following:
  • Bitcask data: /var/lib/riak/bitcask
  • LevelDB data: /var/lib/riak/leveldb
  • Ring data: /var/lib/riak/ring
  • Configuration: /etc/riak

Thats sounds simple enough and to be honest it is. But all backups have to be tested and that is something that it can be as easy as just "tar tvfz" or bringing up a second cluster with the backup data, there are some logical questions that come to mind here, i will point out some but there are probably more.
  1. A 2 node cluster means that you need 2 nodes to restore ?
  2. If i bring the cluster up who says that it wont join my primary cluster ?
  3. How can i test that the backup worked ?
Answers : 1. yes and no 2. i do, and i will show you how 3. same way you test your riak installation , by doing queries.

Now, im not an expert in Riak, i just touched it 3 days ago but for this specific scenario i am pretty confident.

lets say that you have a 2 node cluster which means 2 machines, the database consists in 2 different places right ? -right but you can bring 2 nodes up using one server. To do that you have to compile riak or else you will have one global configuration file and you wont be able to have 2 nodes in the same host.

to compile riak you will need the following packages with their dependences:
g++,make,gcc, erlang, erlang-dev, git

To compile , untar the downloaded source and type "make rel", this will give you a directory called rel , inside there you will find a directory called riak, you may copy that to riak1,riak2,riak-test,riak-dev etc, this directory has its own executables , config files and bitcask directories so it means that you can bring as many riaks up as your hardware allows.
Which bring us to the next question, lets assume that you have a 2 node single host cluster ready for startup, edit the 2 config files (app-config, vm.args)and fix the ips and the ports for each node BUT pay attention in vm.args to set the parameter  -setcookie right, by right i mean different name from the production and common between the 2 test nodes. setcookie value will allow or not the cluster to connect with another cluster so pay attention there.
now untar the backups into the new nodes: /root/riak-1.4.0/rel/riak1/data/bitcask for node1 and /root/riak-1.4.0/rel/riak2/data/bitcask for node2 and start riak, and now stop it, we have to reip the nodes, im not sure about this but i think that there are references into the data about the original node name,
to change that you must run : riak-admin reip <old_nodename> <new_nodename> and you started riak first to create some files called ring files, reip will change those files along with everything else needed and now you may start your nodes , join these 2 and play arround (if you dont know how , refer to riak site their documentation is very good)

to test riak use curl, in my tests i inserted data in both nodes on my wannabe production cluster and i did queries after i restored in known documents, riak documentation has quite some examples, if i put them here they would be copy paste from there.


i am by no chance a riak expert and if i did something wrong feel free to comment.

Thanks for reading
-Vasilis











Παρασκευή 12 Ιουλίου 2013

Implementing wata warehousing with Hadoop , HDFS and Hive (Part 1)

About one month ago, a big part of my life changed, i moved to London in order to work with a US based company that i know for quite a while for its work and for the great IT personalities that work there, OmniTI Consulting. Until then i didnt know much about BigData, actually i had no idea what this BigData gig was and why everyone was  talking about it. I tend to be traditional and these distributed filesystems , noSQL databases MapReduce functions seemed a bit strange and really unfamiliar to me. I kept saying that at some point RDBMS's will incorporate this functionality and they will slowly die ,something like ODBMS's , something like json as postgresql datatype etc..
Well, after reading a bit about hadoop i changed my mind -THIS THING IS AWESOME, those who share the same ideas with me about RDBM's , bare with me i will explain in a second why, and as i always do, with a real scenario that i had long time ago when i was working in a telco.

Let me tell you somethings about telcos, they have a HIGH need for data warehousing , realtime reporting and they have A LOT of data ! each call makes at least one record , called CDR (call detail record) saying basically who called whom, when, how much time did they talk etc. the result is a lot of database traffic, back in my day we had something like 1-2m rows daily, which means about 500m rows per year now it doesnt sound like much but it was 2002-2003 and the hard disks that i was buying for the servers were 18Gb back then.

Now imagine that you are the marketing guy and you wanna make tariffs, (call cost per destination). You would ask questions like , how many from that prefix called to that prefix ? which times, how long did they talk in average,how does this distributes in a week ? and so , so , so many more questions... My approach back then was to deploy a statistical server, sounds awesome right ? - ITS NOT, that server took part of the data and answered specific questions that were populating tables and of course reports, graphs , pies , charts and all the goods that marketing guys want to watch in daily basis. So, what happened when they wanted a report but for a non existing  timeframe? Thats easy, i fed the reporting server with data from the requested timeframe and the sql/bash scripts were doing their things. What happened if they wanted a new report ? I CRIED -but men dont cry.
DBA's cry !
I have evaluated products like IQServer and business objects, a good product but limited, fast , but maintenance was a pain in the ass, also expensive.. VERY expensive, so i decided to stay with my PostgreSQL , buy 3-4 more servers and distribute the job, basicaly if i had 1 months rows (CDR's) to process i'd give  10 days cdrs to each server and do the process in 1/3 of the time. a process that was so easy to go south, a slight mistake ment that everything had to run again , and making a mistake wasn't hard at all.

and that's the end of the story of what i did back in 2002 when i was young and tried to be awesome, now people have 1000 times the data that i had back then and they answering questions equally complex with the ones i had to answer in much , much less time. - How ?
the answer is MapReduce.

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. can be written in python , java , ruby etc, it basically consists in 2 functions the map and the reduce
map is preparing the data filter, sort them and reduce is summing them up. It's that simple ! in order to do that in a clustered environment you need nodes to play along in a common filesystem, in Hadoop world thats HDFS, Hadoop Distributed File System. So basically you can get raw data, db records or postgresql logs and parse them through mapreduce functions to get answers to questions. here's an example :

a map function :

#!/usr/bin/env python
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

a reduce function:
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()

    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count   
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

how it works:

echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | \
sort -k1,1 | /home/hduser/reducer.py
bar     1
foo     3
labs    1
quux    2 

now this code can run with no change in hadoop having many servers doing this in paralel with many MANY data... I can only imagine what i could do if i had this thing back in the days... To be continued with a useful (hopefully) example and a benchmark with one and 2 nodes hadoop cluster. the lab is ready i just need to make up an example and code a mapReduce


  
Thanks for reading
- Vasilis

 





Setting shared_buffers effectively.

One of the main performance parameters in PostgreSQL is shared_buffers, probably the most important one, there are guidelines and rules of thumb that say just set it to 20-30% of your machine total memory.
Don't get me wrong , these rules are generaly good and its a perfect starting point, but there are reasons why you should tune your shared buffers in more detail

a. 30% might not be enough and you will never know if you dont know exactly how to set this parameter
b. 30% might be a lot and you spend resources in vain.
c. you want to be awesome and tune every bit of your DB the optimal way.

What is shared buffers though ?
Shared buffers defines a block of memory that PostgreSQL will use to hold requests that are awaiting attention from the kernel buffer and CPU.
so basically PostgreSQL will put temporarily data blocks in the memory in order to process - EVERYTHING will go through the shared buffers.

Why not set shared_buffers to 80% of ram in a DB dedicated server ?
The OS also has cache, and if you set shared_buffers too high you will most likely have an overlap which called double buffering, having datablocks on both caches.

So, in order to set your shared_buffers you need to know what's happening inside shared memory.PostgreSQL has an implementation called clock-sweep algorithm, so everytime you use a datablock a usage counter is making that block harder to get rid of. the block gets a popularity number from 1-5 with 5 being heavily used and most likely it will stay in shared memory.
In theory you want the most popular data blocks in the shared buffers and the least popular ones out of it. To do that you have to be able to see what is inside the shared buffers, and thats exactly what pg_buffercache package does.You will find pg_buffercache in the contrib.

lets create 2 tables and full join them, update them and generaly run operations on these 2 tables while monitoring the buffer cache
I will give dramatic examples by setting shared_buffers too low , default and too high just to demonstrate what pg_buffercache views will show and then i i will find a good value for this specific workflow.i will run the same statements while i analyze what is happening inside shared buffers.
the to most very useful sql statements from the pg_buffercache views are the following :

-- buffers per relation and size
SELECT
c.relname,
pg_size_pretty(count(*) * 8192) as buffered,
round(100.0 * count(*) /
(SELECT setting FROM pg_settings
WHERE name='shared_buffers')::integer,1)
AS buffers_percent,
round(100.0 * count(*) * 8192 /
pg_relation_size(c.oid),1)
AS percent_of_relation
FROM pg_class c
INNER JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.oid,c.relname
ORDER BY 3 DESC
LIMIT 10;

-- buffers per usage count
SELECT
c.relname, count(*) AS buffers,usagecount
FROM pg_class c
INNER JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname,usagecount
ORDER BY usagecount,c.relname