Introduction to pig

Apache Pig – Introduction and
Hands-on
Ravi Mutyala
Systems Architect, Hortonworks
Twitter: @rmutyala

© Hortonworks Inc. 2012

Big Data Platforms
Cost per TB, Adoption

Size of bubble = cost
effectiveness of solution

Source:

2

Topics
• What is Pig?
• Why Pig ?
• Language Features
• Labs
• 0.10.0 Features
• Features in the pipeline
•Q &A

Page 3

What is Pig?
• System for processing large unstructured Data
• Uses HDFS and MapReduce
• Data flow Language
• Directional Asymptotic Graph
• Started at Yahoo! Research
• Joined Apache incubator in 2007
• Graduated to Subproject of Hadoop in 2008
• Top level project in Apache since 2010

Page 4

Pig Philosophy 
• Pigs eat anything
• Pigs live anywhere
• Pigs are domesticated animals
• Pigs can fly

Page 5

Components
• Pig Engine – Parser, Optimizer and distributed query
execution
• Grunt – CLI shell
• Pig Latin – Procedural Language

Page 6

Why Pig ?
• High level language that increases programmer
productivity.
• Designed for Parallel Data flow.
• Reduces complexity by abstracting low level Map and
Reduce jobs and Map Reduce job chaining
• Can be run on a client/gateway machine with no
configuration on the cluster
• Multiple versions of Pig can co-exist as long as they
are compatible with Hadoop version.

Page 7

Running Pig
Pig Latin script executes in 3 modes
• MapReduce: Code executes as MapReduce on a
Hadoop Cluster
$ pig myscript.pig
• Local: Code executes locally in a single JVM using
local data
$ pig –x local myscript.pig

• Interactive: pig with no script starts the grunt shell
where commands can be run interactively

Page 8

GRUNT shell
• fs -ls
• fs -cat filename
• fs -copyFromLocal localfile hdfsfile

Page 9

Data Types
• Scalar Types
– int, long, float, double, chararray, bytearray, boolean, datetime
• Complex Types
– Map. Collection of key value pairs
– [name#alan, age#30]
– Tuple. Ordered set of values
– (alan,40,engineering)
– Bags. Unordered collection of tuples
– {(alan,40,engineering),(bob,45,sales)}

Page 10

• Relations and a set of operations that work on
relations
• Schema for relations is optional
• $0… $n can be used for fields in relations
• null means the data in undefined.
• Any missing or invalid fields are loaded as null

Page 11

Input and Output
• A = LOAD ‘file’ USING PigStorage(‘,’) AS
(data1:datatype1, data2:datatype2.. )

• STORE A INTO ‘file2’ using PigStorage(‘,’)

• DUMP A

• DESCRIBE A

Page 12

Relational Operations
• GROUP A BY A.age;

• FOREACH B GENERATE A.$1 – A.$3;

• FILTER A BY A.$1 > 10;

• ORDER A BY A.$1 DESC, A.$2;

• JOIN A BY A.$1, B BY B.$5;
• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2,
B.$3);

Page 13

• LIMIT A 10;

• SAMPLE A 0.1;

• GROUP A BY A.$1 PARALLEL 10;

• User Definited Functions AND piggybank
register 'your_path_to_piggybank/piggybank.jar';
divs = load 'NYSE_dividends’;
backwards = foreach divs generate
org.apache.pig.piggybank.evaluation.string.Reverse($1);

Page 14

• Invoking static java methods

• FLATTEN

• TOKENIZE

Page 15

0.10.0 Features
• Ruby UDFs
• PigStorage with schemas
• Additional UDF improvements
• Language Improvements
– Boolean type
– otherwise
– Maps, Bags and Tuples can be generated without UDFs
– Register collection of jars
• Performance Improvements

Page 16

Current work in progress
• DataTime datatype
• CUBE, ROLLUP and RANK operators
• Native support for windows
• Lower memory footprint

Page 17

References
• Labs are from
– https://github.com/alanfgates/programmingpig
– https://github.com/michiard/CLOUDS-LAB

• 0.10.0 Features and current WIP
– http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan
Gates

Page 18

Hortonworks Training
The expert source for
Apache Hadoop training & certification

Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available

Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert

Page 19

Thank You!
Questions & Answers
Ravi Mutyala
Systems Architect
Hortonworks
Twitter: @rmutyala
www.hortonworks.com

Page 20

Introduction to pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Introduction to pig

Similar to Introduction to pig (20)

Recently uploaded

Recently uploaded (20)

Introduction to pig