Hadoop pig

The analytics stack
Hadoop & Pig

Outline of the presentation

 Hadoop
 Motivations. What is it? And high-level concepts
 The Ecosystem. The MapReduce model & framework
and HDFS
 Programming with Hadoop
 Pig
 What is it? Motivations
 Model & components
 Integration with Cassandra
2

Please interrupt and ask questions!

3

Traditional HPC systems

 CPU-intensive computations
 Relatively small amount of data
 Tightly-coupled applications
 Highly concurrent I/O requirements
 Complex message passing paradigms such as MPI,
PVM…
 Developers might need to spend some time
designing for failure
4

Challenges

 Data and storage
 Locality, computation close to the data

 In large-scale systems, nodes fail
 Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day
 Built-in fault-tolerance

 Distributed programming is complex
 Need a simple data-parallel programming model. Users would
structure the application in high-level functions, the system
distributes the data & jobs and handles communications and faults

5

What requirements

 A simple data-parallel programming model, designed for
high scalability and resiliency
 Scalability to large-scale data volumes
 Automated fault-tolerance at application level rather
than relying on high-availability hardware
 Simplified I/O and tasks monitoring
 All based on cost-efficient commodity machines (cheap,
but unreliable), and commodity network

6

Hadoop’s core concepts

 Data spread in advance, persistent (in terms of
locality), and replicated
 No inter-dependencies / shared nothing architecture
 Applications written in two pieces of code
 And developers do not have to worry about the
underlying issues in networking, jobs interdependencies,
scheduling, etc…

7

Where does it come from?

 Hadoop originated from Apache Nutch, an open source
web search engine
 After the publications of the GFS and MapReduce papers,
in 2003 & 2004, the Nutch developers decided to
implement open source versions
 In February 2006, it became Hadoop, with a dedicated
team at Yahoo!
 September 2007 - release 0.14.1
 Last release 1.0.3 out last week
 Used by a large number of companies including Facebook,
LinkedIn, Twitter, hulu, among many others..
8

The model

 A map function processes a key/value pair to generate a set of
intermediate key/value pairs
 Divides the problem into smaller ‘intermediate key/value’ pairs
 The reduce function merge all intermediate values associated with
the same intermediate key
 Run-time system takes care of:
 Partitioning the input data across nodes (blocks/chunks typically of
64Mb to 128Mb)
 Scheduling the data and execution. Maps operate on a single block.
 Manages node failures, replication, re-submissions..

9

Simple Word Count
♯key: offset, value: line
def mapper():
for line in open(“doc”):
for word in line.split():
output(word, 1)

♯key: a word, value: iterator over counts
def reducer():
output(key, sum(value))

10

The Combiner

 A combiner is a local aggregation function for repeated keys
produced by the map
 Works for associative functions like sum, count, max

 Decreases the size of intermediate data / communications

 map-side aggregation for word count:
def combiner():
output(key, sum(values))

11

Some other basic examples…
 Distributed Grep:
 Map function emits a line if it matches a supplied pattern
 Reduce function is an identity function that copies the supplied intermediate
data to the output
 Count of URL accesses:
 Map function processes logs of web page requests and outputs <URL, 1>
 Reduce function adds together all values for the same URL, emitting <URL, total
count> pairs
 Reverse Web-Link graph:
 Map function outputs <tgt, src> for each link to a tgt in a page named src
 Reduce concatenates the list of all src URLS associated with a given tgt URL and
emits the pair: <tgt, list(src)>
 Inverted Index:
 Map function parses each document, emitting a sequence of <word, doc_ID>
 Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair

12

Hadoop Ecosystem

Core 13

components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

Hadoop components

 Hadoop consists of two core components
 The MapReduce framework, and
 The Hadoop Distributed File System

 MapReduce layer
 JobTracker
 TaskTrackers
 HDFS layer
 Namenode
 Secondary namenode
 Datanode
Example of a typical physical distribution within a
14 Hadoop cluster

HDFS

 Scalable and fault-tolerant. Based on Namenode
Google’s GFS File1
1
 Single namenode stores metadata (file 2
3
names, block locations, etc.). 4

 Files split into chunks, replicated across
several datanodes (typically 3+). It is rack-
aware
 Optimised for large files, sequential
1 2 1 3
streaming reads, rather than random 2 1 4 2
4 3 3 4
 Files written once, no append
Datanodes
15

HDFS

 HDFS API / HDFS FS Shell for command line*
> hadoop fs –copyFromLocal local_dir hdfs_dir
> hadoop fs –copToLocal hdfs_dir local_dir

 Tools
 Flume: collects, aggregates and move log data from application
servers to HDFS
 Sqoop: HDFS import and export to SQL

*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
16

MapReduce execution

 In Hadoop, a Job (full program) is a set of tasks
 Each task (mapper or reducer) is attempted at least once, or
multiple times if it crashes. Multiple attempts may also occur
in parallel
 The tasks run inside a separate JVM on the tasktracker

 All the class files are assembled into a jar file, which will be
uploaded into HDFS, before notifying the tasktracker

17

MapReduce execution

MapReduce Job
Master

Split 0 Worker
Split 1 Worker
read Local write
Split 2 Worker Remote read
Worker
Split 3
Split 4 Worker Output files

Intermediate
Input files files locally

18

Getting Started…

 Multiple choices - Vanilla Apache version, or one of the
numerous existing distros
 hadoop.apache.org
 www.cloudera.com [A set of VMs is also provided]
 http://www.karmasphere.com/
 …

 Three ways to write jobs in Hadoop:
 Java API
 Hadoop Streaming (for Python, Perl, etc.)
 Pipes API (C++)

19

Word Count in Java
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);

FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

JobClient.runJob(conf);
} 20

Word Count in Java – mapper

public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
21

Word Count in Java – reducer

public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
22

Getting keys and values

Input file

Reducer Reducer
Input Format

Input split Input split

Output Format
RecordWriter RecordWriter
RecordReader RecordReader

Output file Output file
Mapper Mapper

23

Hadoop Streaming
Mapper.py: #!/usr/bin/env python

import sys
for line in sys.stdin:
for word in line.split():
print "%st%s" % (word, 1)

Reducer.py: #!/usr/bin/env python
import sys
dict={}
for line in sys.stdin:
word, count = line.split("t", 1)
dict[word] = dict.get(word, 0) + int(count)
counts = dict.items()
for word, count in counts:
print "%st%s" % (word.lower(), count)
You can locally test your code on the command line:
$> cat data | mapper | sort | reducer
24

High-level tools

 MapReduce is fairly low-level: must think about
keys, values, partitioning, etc.
 How to express parallel algorithms by a series of
MapReduce jobs
 Can be hard to capture common job building blocks

 Different use cases require different tools!

25

Pig

 Apache Pig is a platform raising a level of abstraction for
processing large datasets. Its language, Pig Latin is a simple
query algebra expressing data transformations and applying
functions to records

Pig
MapReduce jobs Hadoop / HDFS
job submission

 Started at Yahoo! Research, >60% of Hadoop jobs within
Yahoo! are Pig jobs
26

Motivations

 MapReduce requires a Java programmer
 Solution was to abstract it and create a system where users are
familiar with scripting languages

 Other than very trivial applications, MapReduce requires
multiple stages, leading to long development cycles
 Rapid prototyping. Increased productivity

 In MapReduce users have to reinvent common functionality
(join, filter, etc.)
 Pig provides them
27

Used for

 Rapid prototyping of algorithms for processing large datasets
 Log analysis
 Ad hoc queries across various large datasets
 Analytics (including through sampling)

 Pig Mix provides a set of performance and scalability
benchmarks. Currently 1.1 times MapReduce speed.

28

Using Pig

 Grunt, the Pig shell

 Executing scripts directly

 Embedding Pig in Java (using PigServer, similar to SQL
using JDBC), or Python

 A range of tools including Eclipse plug-ins
 PigPen, Pig Editor…

29

Execution modes

 Pig has two execution types or modes: local mode and
Hadoop mode

 Local
 Pig runs in a single JVM and accesses the local filesystem.
Starting form v0.7 it uses the Hadoop job runner.
 Hadoop mode
 Pig runs on a Hadoop cluster (you need to tell Pig about the
version and point it to your Namenode and Jobtracker

30

Running Pig

 Pig resides on the user’s machine and can be independent
from the Hadoop cluster
 Pig is written in Java and is portable
 Compiles into map reduce jobs and submit them to the cluster
 No need to install anything extra on the cluster

Pig client
31

How does it work

 Pig defines a DAG. A step-by-step set of operations, each
performing a transformation
 Pig defines a logical plan for these transformations:

A = LOAD ’file' as (line);
• Parses, checks, & optimises
B = FOREACH A GENERATE
• Plan the execution
FLATTEN(TOKENIZE(line)) AS word;
• Maps & Reduces
C = GROUP B BY word;
• Passes the jar to Hadoop
D = FOREACH C GENERATE group,
• Monitor the progress
COUNT(words);
STORE D INTO ‘output’

32

Data types & expressions

 Scalar type:
 Int, Long, Float, Double, Chararray, Bytearray
 Complex type representing nested structures:
 Tuple: sequence of fields of any type
 Bag: an unordered collection of tuples
 Map: a set of key-value pairs. Keys must be atoms, values may
be any type
 Expressions:
 used in Pig as a part of a statement; field name, position ($),
arithmetic, conditional, comparison, Boolean, etc.
33

Functions

 Load / Store
 Data loaders; PigStorage, BinStorage, BinaryStorage,
TextLoader, PigDump
 Evaluation
 Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…
 Filter
 A special type of eval function used by the FILTER operator.
IsEmpty is a built-in function
 Comparison
 Function used in ORDER statement; ASC | DESC

34

Schemas

 Schemas enable you to associate names and types of the
fields in the relation
 Schemas are optional but recommended whenever
possible; type declarations result in better parse-time error
checking and more efficient code execution

 They are defined using the AS keyword with operators
 Schema definition for simple data types:
> records = LOAD 'input/data' AS (id:int, date:chararray);

35

Statements and aliases

 Each statement, defining a data processing operator /
relation, produces a dataset with an alias

grunt> records = LOAD 'input/data' AS (id:int, date:chararray);

 LOAD returns a tuple, which elements can be referenced by
position or by name

 Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE

36

Filtering data

 Filter is user to work with tuples and rows of data
 Select data you want, or remove the data you are not
interested in

 Filtering early in the processing pipeline minimises the
amount of data flowing through the system, which can
improve efficiency

grunt> filtered_records = FILTER records BY id == 234;

37

Foreach .. Generate

 Foreach .. Generate acts on columns on every row in a
relation
grunt> ids = FOREACH records GENERATE id;

 Positional reference. This statement has the same output
grunt> ids = FOREACH records GENERATE $0;

 The elements of ‘ids’ however are not named ‘id’ unless
you add ‘AS id’ at the end of your statement
grunt> ids = FOREACH records GENERATE $0 AS id;
38

Grouping and joining

 Group .. by makes an output bag containing grouped
fields with the same schema using a grouping key

 Join performs inner, equijoin of two or more relations
based on common field values.

 You can also perform outer joins using keywords left,
right and full

 Cogroup is similar to Group, using multiple relations, and
creates a nested set of output tuples
39

Ordering, combining, splitting…

 Order imposes an order on the output to sort a relation
by one or more fields
 The Limit statement limits the number of results
 Split partitions a relation into two or more relations
 the Sample operator selects a random data sample with
the stated sample size
 the Union operator to merge the contents of two or more
relations

40

Stream

 The Stream operator allows to transform data in a
relation using an external program or script

grunt> C = STREAM A THROUGH `cut -f 2`;
 Extract the second field of A using cut

 The scripts are shipped to the cluster using
grunt> DEFINE script `script.py` SHIP (‘script.py’);
grunt> D = STREAM C THROUGH script AS (…);

41

User defined functions

 Support and a community of user-defined functions (UDFs)

 UDFs can encapsulate users processing logic in filtering,
comparison, evaluation, grouping, or storage
 filter functions for instance are all subclasses of
FilterFunc, which itself is a subclass of EvalFunc

 PiggyBank: the Pig community sharing their UDFs
 DataFu: Linkedin's collection of Pig UDFs

42

A simple eval UDF example

package myudfs;

import …

public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String) input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}

43

An Example

Load Users Load Pages

Let’s find the top 5 most visited Filter by age
pages by users aged 18 – 25.
Input: user data file, and page Join on name

view data file. Group on url

Count clicks

Order by clicks

Take top 5
44

A simple script

Users = LOAD ‘users’ AS (name, age);
Filtered = FILTER Users BY
age >= 18 and age <= 25;
Pages = LOAD ‘pages’ AS (user, url);
Joined = JOIN Filtered BY name, Pages by user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group,
count(Joined) AS clicks;
Sorted = ORDER Summed BY clicks desc;
Top5 = LIMIT Sorted 5;

STORE Top5 INTO ‘top5sites’;

45

i
i
i
i
m
m
m
m
p
p
p
p

i m p o r t
o
o
o
o
r
r
r
r
t
t
t
t
j
j
j
j
a
a
a
a
v
v
v
v
a
a
a
a
.
.
.
.
i
u
u
u
o
t
t
t
.
i
i
i
I
l
l
l
O
.
.
.

o r g . a p a c h e . h a d o o p . f s . P a t h ;
E
A
I
L
x
r
t
i
c
r
e
s
e
a
r
t
p t i o n ;
y L i s t ;
a t o r ;
; / /
f o r
D o t h e
( S t r i n g
f o r
In MapReduce!
c r o s s
s 1 :
( S t r i n g
p r o d u c t
f i r s t )
s 2 :
{
a n

s e c o n
}
r e p o r t e r . s e t S t a t u s ( " O K " ) ;

d

d )
c o l l e c t
l p . s e t O u t p u t K e y C
l p . s e t O u t p u t V a l u
l p . s e t M a p p e r C l a s
F i l e I n p u t F o r m a t .
t h e v a l u e s
P a t hu s e r / g a t e s / p a g e s " ) ) ;
{
( " /
F i l e O u t p u t F o r m a t
l
e
s
a

.
a
C
(
d

s
s
l
L
d
s
a
o
I
(
s
a
n

e t O u t
T
s
d
p
e
(
P
u
x
T
a
t
t
e
g
P
.
x
e
a
c
t
s
t

p u t P a t h
l
.
.
h
a
c
c
(
s
l
l
l

i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u
" , + s 2 ; s e r / g a t e s / t m p /
i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T
t v a l ) ) ; a s k s ( 0 ) ;
i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ;
i p o r t
m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; }
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E x
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d
l f u . s a n d F i l t e r U s e r s
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c
I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a
d u c e B a s e
i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ;
a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c
W r i t a b l e > {
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u ,
p u t F o r m a t . a d d n e w
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e >
O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p /
i m p t
o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
p t i o n {
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d
r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ;
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ;
i n = n e w J o b C o n f (
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C
o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d
o m a ) ; e x O f ( ' , ' , j o i n . s e t J o b N a m e ( " J o i n
f i r s t U s e r s a n
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e
m a , s e c o n d C o m m a ) ;
/ / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l
n ' t n e e d i t a n y m o r e ,
p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t .
d u c e r t o s u m i n s t e a d .
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t
a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a
p e r . c l a s s ) ;
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y ,
T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s
L ) ) ;
} F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) )
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c
p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
d u c e B a s e
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s
p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " )
t a b l e , W r i t a b l e C o m p a r a b l e ,
/ / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n ,
t p u t F o r m a t . s e n e w
S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ;
g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ;
S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e >
) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a
T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s
l e , W r i t a b l e > o c ,
/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r )
w h i c h f i e t h r o w s I O E x c e p t i o n {
/ / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R
x m l . l a s s ) ;
T e x t o u t V a l v= ln e w ;T e x t ( " 1
" + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " )
o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T
} i l e (w h e r . h a s N e x t ( ) )
i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c
} s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g
p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K
d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e
l e O u t p u t F o r m a t . c l a
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t >
T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e
g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r
) ) ;
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g
R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n
E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
/ / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r
F i l e O u t p u t F
S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s
i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ;
P t h ( " / u s e r g a t e s / t m p / g r o
i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){
f ( ' , ' ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
S t r i n g v a lf i r s t C o m m a
u e = l i n e . s+ b1 ) ;
u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ;
i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J
i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y ,
S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M
T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p
e x t > o c , 1 0 0 s i t e
/ / P r e p e n d a n ei n d e x
k n o w t o it h e
w h c h fv a l u e
i l e s o w R et h r o w s
p o t e r I O E x c e p t i o n
r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e
/ / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W
k e y ) ;
T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x
o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ;
r m a t ( S e q u e n c
} p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c
e d u c e B a s e
} i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C
o n g W r i t a b l e , T e x t > {
p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t
t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t
p u b l i c e d u c e (
v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p
I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1
O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r
n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b
n { l i m i t = n e w J o b ( t o p 1 0 0 ) ;
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b
s t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s
/ / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) )
h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r
w J b C o n t o l
L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y ,
; i t e r . n e x1 8 )t o
t ( ) ; 2 5 " ) ;
L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ;
) ; j c . a d d J o b ( l o a d P a g e s ) ;
} j c . a d d J o b ( l o a d U s e r s ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ;
T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ;
S t S t r
r i n i
g n g
v (
a )
l ;
u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ;
s I O E x c e p t i o n {
i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ;
) ;
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d
l p . s e P a g e s " ) ; }
e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ;
1 ) ) ; t
46

Ease of Translation

Load Users Load Pages
Users = LOAD …
Filter by age Filtered = FILTER …
Pages = LOAD …
Join on name Joined = JOIN …
Group on url Grouped = GROUP …
Summed = … COUNT()…
Count clicks Sorted = ORDER …
Order by clicks
Top5 = LIMIT …

Take top 5

47

The Hadoop/Pig/Cassandra stack
 Cassandra has gained some significant integration points
with Hadoop and its analytics tools
 In order to achieve Hadoop’s data locality, Cassandra nodes
must be part of the Hadoop cluster by running a tasktracker
process. So the namenode and jobtracker can reside outside
of the Cassandra cluster

A three- node
Cassandra/Hadoop
cluster with external
namenode / jobtracker

48

Hadoop jobs

 Cassandra has a Java source package for Hadoop integration
org.apache.cassandra.hadoop

 ColumnFamilyInputFormat extends InputFormat
 ColumnFamilyOutputFormat extends OutputFormat
 ConfigHelper a helper class to configure Cassandra-specific
information

 Hadoop output streaming was introduced in 0.7 but removed
from 0.8

49

Pig alongside Cassandra

 The Pig integration CassandraStorage() (a LoadFunc
implementation) allows Pig to Load/Store data from/to
Cassandra
grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();

 The pig_cassandra script, shipped with Cassandra source,
performs the necessary initialisation (Pig environments
variables still needs to be set)

 Pygmalion is a set of scripts and UDFs to facilitate the use of Pig
alongside Cassandra

50

Workflow

 A workflow system provides an infrastructure to set up &
manage a sequence of interdependent jobs / set of jobs

 The hadoop ecosystem includes a set of workflow tools to
run applications over MapReduce processes or High-level
languages
 Cascading (http://www.cascading.org/). A java library defining data
processing workflows and rendering them to MapReduce jobs
 Oozie (http://yahoo.github.com/oozie/)

51

Some links

 http://hadoop.apache.org
 http://pig.apache.org/
 https://cwiki.apache.org/confluence/display/PIG/Index
 PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank
 DataFu: https://github.com/linkedin/datafu
 Pygmalion: https://github.com/jeromatron/pygmalion
 http://code.google.com/edu/parallel/mapreduce-tutorial.html
 Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training
 Interesting papers:
 http://bit.ly/rskJho - Original MapReduce paper
 http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of
MapReduce: The Pig Experience’

52

A simple data flow

Load checkins data

Keep only the two ids
Top 50 users / locations
[same script, different group key]
Group by user/loc id & Order

Limit to top 50

53

Another data flow

Load checkins data

Split_date

All the checkins, over weeks
Group by date

Group by weeks using
Count the tuples
Stream

54

Hadoop pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Hadoop pig

Similar to Hadoop pig (20)

More from Sean Murphy

More from Sean Murphy (8)

Recently uploaded

Recently uploaded (20)

Hadoop pig