Παρασκευή, 12 Ιουλίου 2013

Implementing wata warehousing with Hadoop , HDFS and Hive (Part 1)

About one month ago, a big part of my life changed, i moved to London in order to work with a US based company that i know for quite a while for its work and for the great IT personalities that work there, OmniTI Consulting. Until then i didnt know much about BigData, actually i had no idea what this BigData gig was and why everyone was  talking about it. I tend to be traditional and these distributed filesystems , noSQL databases MapReduce functions seemed a bit strange and really unfamiliar to me. I kept saying that at some point RDBMS's will incorporate this functionality and they will slowly die ,something like ODBMS's , something like json as postgresql datatype etc..
Well, after reading a bit about hadoop i changed my mind -THIS THING IS AWESOME, those who share the same ideas with me about RDBM's , bare with me i will explain in a second why, and as i always do, with a real scenario that i had long time ago when i was working in a telco.

Let me tell you somethings about telcos, they have a HIGH need for data warehousing , realtime reporting and they have A LOT of data ! each call makes at least one record , called CDR (call detail record) saying basically who called whom, when, how much time did they talk etc. the result is a lot of database traffic, back in my day we had something like 1-2m rows daily, which means about 500m rows per year now it doesnt sound like much but it was 2002-2003 and the hard disks that i was buying for the servers were 18Gb back then.

Now imagine that you are the marketing guy and you wanna make tariffs, (call cost per destination). You would ask questions like , how many from that prefix called to that prefix ? which times, how long did they talk in average,how does this distributes in a week ? and so , so , so many more questions... My approach back then was to deploy a statistical server, sounds awesome right ? - ITS NOT, that server took part of the data and answered specific questions that were populating tables and of course reports, graphs , pies , charts and all the goods that marketing guys want to watch in daily basis. So, what happened when they wanted a report but for a non existing  timeframe? Thats easy, i fed the reporting server with data from the requested timeframe and the sql/bash scripts were doing their things. What happened if they wanted a new report ? I CRIED -but men dont cry.
DBA's cry !
I have evaluated products like IQServer and business objects, a good product but limited, fast , but maintenance was a pain in the ass, also expensive.. VERY expensive, so i decided to stay with my PostgreSQL , buy 3-4 more servers and distribute the job, basicaly if i had 1 months rows (CDR's) to process i'd give  10 days cdrs to each server and do the process in 1/3 of the time. a process that was so easy to go south, a slight mistake ment that everything had to run again , and making a mistake wasn't hard at all.

and that's the end of the story of what i did back in 2002 when i was young and tried to be awesome, now people have 1000 times the data that i had back then and they answering questions equally complex with the ones i had to answer in much , much less time. - How ?
the answer is MapReduce.

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. can be written in python , java , ruby etc, it basically consists in 2 functions the map and the reduce
map is preparing the data filter, sort them and reduce is summing them up. It's that simple ! in order to do that in a clustered environment you need nodes to play along in a common filesystem, in Hadoop world thats HDFS, Hadoop Distributed File System. So basically you can get raw data, db records or postgresql logs and parse them through mapreduce functions to get answers to questions. here's an example :

a map function :

#!/usr/bin/env python
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

a reduce function:
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()

    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count   
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

how it works:

echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | \
sort -k1,1 | /home/hduser/reducer.py
bar     1
foo     3
labs    1
quux    2 

now this code can run with no change in hadoop having many servers doing this in paralel with many MANY data... I can only imagine what i could do if i had this thing back in the days... To be continued with a useful (hopefully) example and a benchmark with one and 2 nodes hadoop cluster. the lab is ready i just need to make up an example and code a mapReduce


  
Thanks for reading
- Vasilis

 





Δεν υπάρχουν σχόλια:

Δημοσίευση σχολίου