Hadoop Pig: MapReduce the easy way!

  • 16,855 views
Uploaded on

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
16,855
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
238
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop Pig: MapReduce the easy way.Nathan Bijnenshttp://nathan.gs@nathan_gs
  • 2. We live in a world of data.
  • 3. ● Data analysis becomes more and more important● Increasing complexity of analysis● Meanwhile the data we analyze grows big, fast! s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
  • 4. Hadoop: IntroHadoop is an open source Java framework aimed at data intensive distributed applications.It enables applications to work with thousands of nodes and petabytes of data.
  • 5. Hadoop: IntroHadoop was inspired by Googles Map Reduce and Google File System.http://labs.google.com/papers/mapreduce.html
  • 6. Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes,where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
  • 7. Hadoop: HDFS● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure● DataNodes
  • 8. Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  • 9. MapReduceMapReduce works by breakingprocessing into two phases, a map anda reduce function.
  • 10. MapReduce● Input● Map● Shuffle● Reduce● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  • 11. Use Cases: Who & how its usedMassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.
  • 12. Use Cases: Who & how its used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes!s: http://wiki.apache.org/hadoop/PoweredBy
  • 13. Use Cases: When not to useSETI@home ● Highly CPU oriented ● data locality is unimportant!
  • 14. Hadoop Pig: IntroPig is a high level data flow language.
  • 15. Hadoop Pig: 3 components Pig Latin Grunt PigServer
  • 16. Hadoop Pigdata = LOAD employee.csv USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray );grouped_by_department = GROUP data BY department;total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage;total_ordered = ORDER total_wage_by_department BY total_wage;total_limited = LIMIT total_ordered 10;DUMP total_limited;
  • 17. books = LOAD books.csv.bz2 USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray );book_sales = LOAD book_sales.csv.bz2 USING PigStorage() AS ( book_id:int, price:float, country:chararray );--- books = FILTER books BY (author_name LIKE Pamuk);data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;grouped_by_book = GROUP data BY books::book_name;total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales;STORE total_sales_by_book INTO book_sale_results;
  • 18. UDF● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog● Custom extraction eg. date, ... Take a look at the PiggyBank.
  • 19. Some alternatives● Hive● Streaming● Native Java MapReduce
  • 20. Questions?
  • 21. Thank you for listening!