Hadoop Pig:     MapReduce the easy way.Nathan Bijnenshttp://nathan.gs@nathan_gs
We live in a world of data.
●   Data analysis becomes    more and more    important●   Increasing complexity    of analysis●   Meanwhile the data we  ...
Hadoop: IntroHadoop is an open source Java framework aimed   at data intensive distributed applications.It enables applica...
Hadoop: IntroHadoop was inspired by Googles Map Reduce         and Google File System.http://labs.google.com/papers/mapred...
Hadoop: HDFS    HDFS is a distributed, scalable filesystem         designed to store large files. In combination with the ...
Hadoop: HDFS●    NameNode    ● Keeps track of what is stored where     ● In memory    ● Single Point of Failure●   DataNodes
Hadoop: HDFS         s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar         http://www.slideshare.n...
MapReduceMapReduce works by breakingprocessing into two phases, a map anda reduce function.
MapReduce●   Input●   Map●   Shuffle●   Reduce●   Output              s: Practical problem solving with Hadoop and Pig by ...
Use Cases: Who & how its usedMassiveMedia / Netlog ● Cases  ● Traffic analysis  ● User actions  ● ... ● On a 7 node cluster.
Use Cases: Who & how its used            Yahoo!                   ● Cases                    ● Ad Systems                 ...
Use Cases: When not to useSETI@home ●   Highly CPU oriented ●   data locality is unimportant!
Hadoop Pig: IntroPig is a high level data flow language.
Hadoop Pig: 3 components            Pig Latin             Grunt           PigServer
Hadoop Pigdata = LOAD employee.csv USING PigStorage() AS (       first_name:chararray,       last_name:chararray,       ag...
books = LOAD books.csv.bz2 USING PigStorage() AS (       book_id:int,       book_name:chararray,       author_name:chararr...
UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction    eg. date, ...    Take a ...
Some alternatives●   Hive●   Streaming●   Native Java MapReduce
Questions?
Thank you for listening!
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
Upcoming SlideShare
Loading in...5
×

Hadoop Pig: MapReduce the easy way!

19,956

Published on

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

Published in: Technology

Hadoop Pig: MapReduce the easy way!

  1. 1. Hadoop Pig: MapReduce the easy way.Nathan Bijnenshttp://nathan.gs@nathan_gs
  2. 2. We live in a world of data.
  3. 3. ● Data analysis becomes more and more important● Increasing complexity of analysis● Meanwhile the data we analyze grows big, fast! s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
  4. 4. Hadoop: IntroHadoop is an open source Java framework aimed at data intensive distributed applications.It enables applications to work with thousands of nodes and petabytes of data.
  5. 5. Hadoop: IntroHadoop was inspired by Googles Map Reduce and Google File System.http://labs.google.com/papers/mapreduce.html
  6. 6. Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes,where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
  7. 7. Hadoop: HDFS● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure● DataNodes
  8. 8. Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  9. 9. MapReduceMapReduce works by breakingprocessing into two phases, a map anda reduce function.
  10. 10. MapReduce● Input● Map● Shuffle● Reduce● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  11. 11. Use Cases: Who & how its usedMassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.
  12. 12. Use Cases: Who & how its used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes!s: http://wiki.apache.org/hadoop/PoweredBy
  13. 13. Use Cases: When not to useSETI@home ● Highly CPU oriented ● data locality is unimportant!
  14. 14. Hadoop Pig: IntroPig is a high level data flow language.
  15. 15. Hadoop Pig: 3 components Pig Latin Grunt PigServer
  16. 16. Hadoop Pigdata = LOAD employee.csv USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );grouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;DUMP total_limited;
  17. 17. books = LOAD books.csv.bz2 USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray );book_sales = LOAD book_sales.csv.bz2 USING PigStorage() AS ( book_id:int, price:float, country:chararray );--- books = FILTER books BY (author_name LIKE Pamuk);data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;grouped_by_book = GROUP data BY books::book_name;total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales;STORE total_sales_by_book INTO book_sale_results;
  18. 18. UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ... Take a look at the PiggyBank.
  19. 19. Some alternatives● Hive● Streaming● Native Java MapReduce
  20. 20. Questions?
  21. 21. Thank you for listening!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×