Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Pig: MapReduce the easy way!


Published on

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

Published in: Technology

Hadoop Pig: MapReduce the easy way!

  1. Hadoop Pig: MapReduce the easy way.Nathan Bijnens
  2. We live in a world of data.
  3. ● Data analysis becomes more and more important● Increasing complexity of analysis● Meanwhile the data we analyze grows big, fast! s: by pallotron
  4. Hadoop: IntroHadoop is an open source Java framework aimed at data intensive distributed applications.It enables applications to work with thousands of nodes and petabytes of data.
  5. Hadoop: IntroHadoop was inspired by Googles Map Reduce and Google File System.
  6. Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes,where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
  7. Hadoop: HDFS● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure● DataNodes
  8. Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
  9. MapReduceMapReduce works by breakingprocessing into two phases, a map anda reduce function.
  10. MapReduce● Input● Map● Shuffle● Reduce● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
  11. Use Cases: Who & how its usedMassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.
  12. Use Cases: Who & how its used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes!s:
  13. Use Cases: When not to useSETI@home ● Highly CPU oriented ● data locality is unimportant!
  14. Hadoop Pig: IntroPig is a high level data flow language.
  15. Hadoop Pig: 3 components Pig Latin Grunt PigServer
  16. Hadoop Pigdata = LOAD employee.csv USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );grouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;DUMP total_limited;
  17. books = LOAD books.csv.bz2 USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray );book_sales = LOAD book_sales.csv.bz2 USING PigStorage() AS ( book_id:int, price:float, country:chararray );--- books = FILTER books BY (author_name LIKE Pamuk);data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;grouped_by_book = GROUP data BY books::book_name;total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales;STORE total_sales_by_book INTO book_sale_results;
  18. UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ... Take a look at the PiggyBank.
  19. Some alternatives● Hive● Streaming● Native Java MapReduce
  20. Questions?
  21. Thank you for listening!