Hadoop Pig:
     MapReduce the easy way.


Nathan Bijnens
http://nathan.gs
@nathan_gs
We live in a world of data.
●   Data analysis becomes
    more and more
    important
●   Increasing complexity
    of analysis
●   Meanwhile the data we
    analyze grows big, fast!


                               s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
Hadoop: Intro


Hadoop is an open source Java framework aimed
   at data intensive distributed applications.

It enables applications to work with thousands of
          nodes and petabytes of data.
Hadoop: Intro


Hadoop was inspired by Google's Map Reduce
         and Google File System.

http://labs.google.com/papers/mapreduce.html
Hadoop: HDFS

    HDFS is a distributed, scalable filesystem
         designed to store large files.

 In combination with the Hadoop JobTracker it
            provides data locality.

   It auto replicates all blocks to 3 data nodes,
where preferable 2 copies are stored on two data
 nodes within the same rack and one in another
                        rack.
Hadoop: HDFS


●    NameNode
    ● Keeps track of what is stored where

     ● In memory

    ● Single Point of Failure




●   DataNodes
Hadoop: HDFS




         s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
         http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
MapReduce



MapReduce works by breaking
processing into two phases, a map and
a reduce function.
MapReduce


●   Input
●   Map
●   Shuffle
●   Reduce
●   Output


              s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
              http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
Use Cases: Who & how it's used

MassiveMedia / Netlog
 ● Cases
  ● Traffic analysis

  ● User actions

  ● ...

 ● On a 7 node cluster.
Use Cases: Who & how it's used

            Yahoo!
                   ● Cases
                    ● Ad Systems

                    ● Web Search

                    ● ...

                   ● More than 36000 nodes!




s: http://wiki.apache.org/hadoop/PoweredBy
Use Cases: When not to use



SETI@home
 ●   Highly CPU oriented
 ●   data locality is unimportant!
Hadoop Pig: Intro



Pig is a high level data flow language.
Hadoop Pig: 3 components


            Pig Latin

             Grunt

           PigServer
Hadoop Pig
data = LOAD 'employee.csv' USING PigStorage() AS (
       first_name:chararray,
       last_name:chararray,
       age:int,
       wage:float,
       department:chararray
    );

grouped_by_department = GROUP data BY department;

total_wage_by_department =
    FOREACH grouped_by_department
    GENERATE
        group AS department,
        COUNT(data) as employee_count,
        SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;


DUMP total_limited;
books = LOAD 'books.csv.bz2' USING PigStorage() AS (
       book_id:int,
       book_name:chararray,
       author_name:chararray
    );

book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (
       book_id:int,
       price:float,
       country:chararray
    );

--- books = FILTER books BY (author_name LIKE 'Pamuk');

data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;

grouped_by_book = GROUP data BY books::book_name;

total_sales_by_book =
    FOREACH grouped_by_book
    GENERATE
        group as book,
        COUNT(data) as sales_volume,
        SUM(book_sales::price) AS total_sales;

STORE total_sales_by_book INTO 'book_sale_results';
UDF

● Custom Load and Store classes.
 ● Hbase

 ● ProtocolBuffers

 ● CombinedLog

● Custom extraction

    eg. date, ...

    Take a look at the PiggyBank.
Some alternatives


●   Hive
●   Streaming
●   Native Java MapReduce
Questions?
Thank you for listening!

Hadoop Pig: MapReduce the easy way!

  • 1.
    Hadoop Pig: MapReduce the easy way. Nathan Bijnens http://nathan.gs @nathan_gs
  • 2.
    We live ina world of data.
  • 3.
    Data analysis becomes more and more important ● Increasing complexity of analysis ● Meanwhile the data we analyze grows big, fast! s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
  • 5.
    Hadoop: Intro Hadoop isan open source Java framework aimed at data intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data.
  • 6.
    Hadoop: Intro Hadoop wasinspired by Google's Map Reduce and Google File System. http://labs.google.com/papers/mapreduce.html
  • 7.
    Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes, where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.
  • 8.
    Hadoop: HDFS ● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure ● DataNodes
  • 9.
    Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  • 10.
    MapReduce MapReduce works bybreaking processing into two phases, a map and a reduce function.
  • 11.
    MapReduce ● Input ● Map ● Shuffle ● Reduce ● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
  • 12.
    Use Cases: Who& how it's used MassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.
  • 13.
    Use Cases: Who& how it's used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes! s: http://wiki.apache.org/hadoop/PoweredBy
  • 14.
    Use Cases: Whennot to use SETI@home ● Highly CPU oriented ● data locality is unimportant!
  • 16.
    Hadoop Pig: Intro Pigis a high level data flow language.
  • 17.
    Hadoop Pig: 3components Pig Latin Grunt PigServer
  • 18.
    Hadoop Pig data =LOAD 'employee.csv' USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray ); grouped_by_department = GROUP data BY department; total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage; total_ordered = ORDER total_wage_by_department BY total_wage; total_limited = LIMIT total_ordered 10; DUMP total_limited;
  • 19.
    books = LOAD'books.csv.bz2' USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray ); book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS ( book_id:int, price:float, country:chararray ); --- books = FILTER books BY (author_name LIKE 'Pamuk'); data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12; grouped_by_book = GROUP data BY books::book_name; total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales; STORE total_sales_by_book INTO 'book_sale_results';
  • 20.
    UDF ● Custom Loadand Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog ● Custom extraction eg. date, ... Take a look at the PiggyBank.
  • 21.
    Some alternatives ● Hive ● Streaming ● Native Java MapReduce
  • 22.
  • 23.
    Thank you forlistening!