Hadoop Pig: MapReduce the easy way!


Published on

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

Published in: Technology

Hadoop Pig: MapReduce the easy way!

  1. Hadoop Pig: MapReduce the easy way.Nathan Bijnenshttp://nathan.gs@nathan_gs
  2. We live in a world of data.
  3. ● Data analysis becomes more and more important● Increasing complexity of analysis● Meanwhile the data we analyze grows big, fast! s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
  4. Hadoop: IntroHadoop is an open source Java framework aimed at data intensive distributed applications.It enables applications to work with thousands of nodes and petabytes of data.
  5. Hadoop: IntroHadoop was inspired by Googles Map Reduce and Google File System.http://labs.google.com/papers/mapreduce.html
  6. Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes,where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
  7. Hadoop: HDFS● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure● DataNodes
  8. Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  9. MapReduceMapReduce works by breakingprocessing into two phases, a map anda reduce function.
  10. MapReduce● Input● Map● Shuffle● Reduce● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  11. Use Cases: Who & how its usedMassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.
  12. Use Cases: Who & how its used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes!s: http://wiki.apache.org/hadoop/PoweredBy
  13. Use Cases: When not to useSETI@home ● Highly CPU oriented ● data locality is unimportant!
  14. Hadoop Pig: IntroPig is a high level data flow language.
  15. Hadoop Pig: 3 components Pig Latin Grunt PigServer
  16. Hadoop Pigdata = LOAD employee.csv USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );grouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;DUMP total_limited;
  17. books = LOAD books.csv.bz2 USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray );book_sales = LOAD book_sales.csv.bz2 USING PigStorage() AS ( book_id:int, price:float, country:chararray );--- books = FILTER books BY (author_name LIKE Pamuk);data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;grouped_by_book = GROUP data BY books::book_name;total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales;STORE total_sales_by_book INTO book_sale_results;
  18. UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ... Take a look at the PiggyBank.
  19. Some alternatives● Hive● Streaming● Native Java MapReduce
  20. Questions?
  21. Thank you for listening!