More Related Content Similar to Introduction to Apache Pig (20) Introduction to Apache Pig2. Session Outline
What is Pig?
Motivation
Background
Components & Architecture
Pig & Map-Reduce
Case Study – Log Analytics
Conclusion
Sunday, April 29, 2012 © Sabre Holdings, 2012 2
3. What is Pig?
Framework for Analyzing large Data Sets
Sits on top of hadoop
Sunday, April 29, 2012 © Sabre Holdings, 2012 3
5. Pig Food?
Pig has great taste for structured and Unstructured Data.
CSV’s, TSV’s, Delimited Data
Any Kind of Logs
Unstructured Sentences.
Databases via JDBC Connections
Sunday, April 29, 2012 © Sabre Holdings, 2012 5
6. Pig Language?
Pig Understands Pig-Latin (Simple Query Algebra)
- Data Flow Language
- Interdependent series of operations
- Allows ELT’s very effectively
- Filtering/Aggregations/Applying Functions
Sunday, April 29, 2012 © Sabre Holdings, 2012 6
7. Pig is not Racist!!
Pig Streaming
- Pig Stream allows pig’s food to interact with
alien scripts/binaries
A= LOAD ‘log.txt’
C= STREAM A THROUGH ‘extractor.pl’
Sunday, April 29, 2012 © Sabre Holdings, 2012 7
8. Pig vs Traditional Map-Reduce
(Challenges/Solutions)
•Problem:
Resources Map-Reduce requires Java Programmer
•Solution:
Users familiar with scripting languages like Python/Perl can easily code.
•Problem:
Time Map-Reduce involves multiple stages to arrive at a solution
• Solution:
100 lines of Java ~ 10 lines of Pig
4 hours of Java Programming ~ 15 minutes of Pig Programming
•Problem:
In Map-Reduce, users have to re-invent common functionalities like
Baked Join/Cross/Filter
•Solution:
Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction
etc.
Sunday, April 29, 2012 © Sabre Holdings, 2012 8
9. Appetite!
Pigs can digest huge datasets
- Batch Log Processing
NOTE:
Do NOT FEED small datasets to pig. It gets angry.
Sunday, April 29, 2012 © Sabre Holdings, 2012 9
10. Winner in Map-Reduce Race! (1.1x)
If Pig was first, who was second?
Any Guesses?
Sunday, April 29, 2012 © Sabre Holdings, 2012 10
11. How to Access Pig?
Local Mode
MapReduce Mode
Sunday, April 29, 2012 © Sabre Holdings, 2012 11
12. Let’s Ride a Pig
• LOAD
• GENERATE, FOREACH
• FILTERS
• DUMP
• STORE
• STREAM
• REGULAR EXPRESSION EXTRACTION
• Group, Count, Joins
• BAGS vs SETS?
Sunday, April 29, 2012 © Sabre Holdings, 2012 12
13. How can you forget this one?
• Piggy Bank
– Pig library for already defined functions
Sunday, April 29, 2012 © Sabre Holdings, 2012 13
15. CASE STUDY – LOG Analytics
• Apache Access Logs
Let’s work on it!
Sunday, April 29, 2012 © Sabre Holdings, 2012 15
16. RESOURCES
• Documentation – Apache Wiki (not enough)
• Doubts –> Forums
– Stack overflow is my favorite
• Overview
– Cloudera Video Training
• Best Tutorial on internet:
http://pig.apache.org/docs/r0.7.0/tutorial.ht
ml
Sunday, April 29, 2012 © Sabre Holdings, 2012 16