● Data analysis becomes more and more important● Increasing complexity of analysis● Meanwhile the data we analyze grows big, fast! s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
Hadoop: IntroHadoop is an open source Java framework aimed at data intensive distributed applications.It enables applications to work with thousands of nodes and petabytes of data.
Hadoop: IntroHadoop was inspired by Googles Map Reduce and Google File System.http://labs.google.com/papers/mapreduce.html
Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes,where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
Hadoop: HDFS● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure● DataNodes
Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
MapReduceMapReduce works by breakingprocessing into two phases, a map anda reduce function.
MapReduce● Input● Map● Shuffle● Reduce● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
Use Cases: Who & how its usedMassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.
Use Cases: Who & how its used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes!s: http://wiki.apache.org/hadoop/PoweredBy
Use Cases: When not to useSETI@home ● Highly CPU oriented ● data locality is unimportant!
Hadoop Pig: IntroPig is a high level data flow language.
Hadoop Pig: 3 components Pig Latin Grunt PigServer
Hadoop Pigdata = LOAD employee.csv USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );grouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;DUMP total_limited;
books = LOAD books.csv.bz2 USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray );book_sales = LOAD book_sales.csv.bz2 USING PigStorage() AS ( book_id:int, price:float, country:chararray );--- books = FILTER books BY (author_name LIKE Pamuk);data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;grouped_by_book = GROUP data BY books::book_name;total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales;STORE total_sales_by_book INTO book_sale_results;
UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ... Take a look at the PiggyBank.
Some alternatives● Hive● Streaming● Native Java MapReduce