Introduction to Apache Pig

4,101 views

Published on

Brief Motivation talk on Apache Pig

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,101
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Introduction to Apache Pig

  1. 1. HADOOP SESSION-4 Introduction to Pig
  2. 2. Session OutlineWhat is Pig?MotivationBackgroundComponents & ArchitecturePig & Map-ReduceCase Study – Log AnalyticsConclusionSunday, April 29, 2012 © Sabre Holdings, 2012 2
  3. 3. What is Pig?Framework for Analyzing large Data SetsSits on top of hadoopSunday, April 29, 2012 © Sabre Holdings, 2012 3
  4. 4. Pig has map-reduce powers! + =Sunday, April 29, 2012 © Sabre Holdings, 2012 4
  5. 5. Pig Food? Pig has great taste for structured and Unstructured Data. CSV’s, TSV’s, Delimited Data Any Kind of Logs Unstructured Sentences. Databases via JDBC ConnectionsSunday, April 29, 2012 © Sabre Holdings, 2012 5
  6. 6. Pig Language? Pig Understands Pig-Latin (Simple Query Algebra) - Data Flow Language - Interdependent series of operations - Allows ELT’s very effectively - Filtering/Aggregations/Applying FunctionsSunday, April 29, 2012 © Sabre Holdings, 2012 6
  7. 7. Pig is not Racist!! Pig Streaming - Pig Stream allows pig’s food to interact with alien scripts/binariesA= LOAD ‘log.txt’C= STREAM A THROUGH ‘extractor.pl’Sunday, April 29, 2012 © Sabre Holdings, 2012 7
  8. 8. Pig vs Traditional Map-Reduce (Challenges/Solutions) •Problem: Resources Map-Reduce requires Java Programmer •Solution: Users familiar with scripting languages like Python/Perl can easily code. •Problem: Time Map-Reduce involves multiple stages to arrive at a solution • Solution: 100 lines of Java ~ 10 lines of Pig 4 hours of Java Programming ~ 15 minutes of Pig Programming •Problem: In Map-Reduce, users have to re-invent common functionalities like Baked Join/Cross/Filter •Solution: Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction etc.Sunday, April 29, 2012 © Sabre Holdings, 2012 8
  9. 9. Appetite!Pigs can digest huge datasets - Batch Log ProcessingNOTE:Do NOT FEED small datasets to pig. It gets angry.Sunday, April 29, 2012 © Sabre Holdings, 2012 9
  10. 10. Winner in Map-Reduce Race! (1.1x) If Pig was first, who was second?Any Guesses?Sunday, April 29, 2012 © Sabre Holdings, 2012 10
  11. 11. How to Access Pig? Local Mode MapReduce ModeSunday, April 29, 2012 © Sabre Holdings, 2012 11
  12. 12. Let’s Ride a Pig• LOAD• GENERATE, FOREACH• FILTERS• DUMP• STORE• STREAM• REGULAR EXPRESSION EXTRACTION• Group, Count, Joins• BAGS vs SETS?Sunday, April 29, 2012 © Sabre Holdings, 2012 12
  13. 13. How can you forget this one?• Piggy Bank – Pig library for already defined functionsSunday, April 29, 2012 © Sabre Holdings, 2012 13
  14. 14. Theoretical Summarization• Let us not be afraid of Swine Flu, We can still be friends with them.Sunday, April 29, 2012 © Sabre Holdings, 2012 14
  15. 15. CASE STUDY – LOG Analytics• Apache Access Logs Let’s work on it!Sunday, April 29, 2012 © Sabre Holdings, 2012 15
  16. 16. RESOURCES• Documentation – Apache Wiki (not enough)• Doubts –> Forums – Stack overflow is my favorite• Overview – Cloudera Video Training• Best Tutorial on internet: http://pig.apache.org/docs/r0.7.0/tutorial.ht mlSunday, April 29, 2012 © Sabre Holdings, 2012 16

×