Hadoop Pig: MapReduce the easy way!

Hadoop Pig:
MapReduce the easy way.

Nathan Bijnens
http://nathan.gs
@nathan_gs

● Data analysis becomes
more and more
important
● Increasing complexity
of analysis
● Meanwhile the data we
analyze grows big, fast!

s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron

Hadoop: Intro

Hadoop is an open source Java framework aimed
at data intensive distributed applications.

It enables applications to work with thousands of
nodes and petabytes of data.

Hadoop: Intro

Hadoop was inspired by Google's Map Reduce
and Google File System.

http://labs.google.com/papers/mapreduce.html

Hadoop: HDFS

HDFS is a distributed, scalable filesystem
designed to store large files.

In combination with the Hadoop JobTracker it
provides data locality.

It auto replicates all blocks to 3 data nodes,
where preferable 2 copies are stored on two data
nodes within the same rack and one in another
rack.

Hadoop: HDFS

● NameNode
● Keeps track of what is stored where

● In memory

● Single Point of Failure

● DataNodes

Hadoop: HDFS

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

MapReduce

MapReduce works by breaking
processing into two phases, a map and
a reduce function.

MapReduce

● Input
● Map
● Shuffle
● Reduce
● Output

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Use Cases: Who & how it's used

MassiveMedia / Netlog
● Cases
● Traffic analysis

● User actions

● ...

● On a 7 node cluster.

Use Cases: Who & how it's used

Yahoo!
● Cases
● Ad Systems

● Web Search

● ...

● More than 36000 nodes!

s: http://wiki.apache.org/hadoop/PoweredBy

Use Cases: When not to use

SETI@home
● Highly CPU oriented
● data locality is unimportant!

Hadoop Pig: Intro

Pig is a high level data flow language.

Hadoop Pig: 3 components

Pig Latin

Grunt

PigServer

Hadoop Pig
data = LOAD 'employee.csv' USING PigStorage() AS (
first_name:chararray,
last_name:chararray,
age:int,
wage:float,
department:chararray
);

grouped_by_department = GROUP data BY department;

total_wage_by_department =
FOREACH grouped_by_department
GENERATE
group AS department,
COUNT(data) as employee_count,
SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

DUMP total_limited;

books = LOAD 'books.csv.bz2' USING PigStorage() AS (
book_id:int,
book_name:chararray,
author_name:chararray
);

book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (
book_id:int,
price:float,
country:chararray
);

--- books = FILTER books BY (author_name LIKE 'Pamuk');

data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;

grouped_by_book = GROUP data BY books::book_name;

total_sales_by_book =
FOREACH grouped_by_book
GENERATE
group as book,
COUNT(data) as sales_volume,
SUM(book_sales::price) AS total_sales;

STORE total_sales_by_book INTO 'book_sale_results';

UDF

● Custom Load and Store classes.
● Hbase

● ProtocolBuffers

● CombinedLog

● Custom extraction

eg. date, ...

Take a look at the PiggyBank.

Some alternatives

● Hive
● Streaming
● Native Java MapReduce

Hadoop Pig: MapReduce the easy way!

More Related Content

What's hot

Viewers also liked

Similar to Hadoop Pig: MapReduce the easy way!

More from Nathan Bijnens

Recently uploaded

Hadoop Pig: MapReduce the easy way!