HADOOP SESSION-4



   Introduction to Pig
Session Outline

What is Pig?
Motivation
Background
Components & Architecture
Pig & Map-Reduce
Case Study – Log Analytics
Conclusion

Sunday, April 29, 2012       © Sabre Holdings, 2012   2
What is Pig?

Framework for Analyzing large Data Sets
Sits on top of hadoop




Sunday, April 29, 2012     © Sabre Holdings, 2012   3
Pig has map-reduce powers!




                         +                            =
Sunday, April 29, 2012       © Sabre Holdings, 2012       4
Pig Food?
       Pig has great taste for structured and Unstructured Data.


            CSV’s, TSV’s, Delimited Data
            Any Kind of Logs
            Unstructured Sentences.
            Databases via JDBC Connections




Sunday, April 29, 2012       © Sabre Holdings, 2012                5
Pig Language?

      Pig Understands Pig-Latin (Simple Query Algebra)
      - Data Flow Language
             - Interdependent series of operations
      - Allows ELT’s very effectively
      - Filtering/Aggregations/Applying Functions




Sunday, April 29, 2012          © Sabre Holdings, 2012   6
Pig is not Racist!!

     Pig Streaming
     - Pig Stream allows pig’s food to interact with
     alien scripts/binaries

A= LOAD ‘log.txt’
C= STREAM A THROUGH ‘extractor.pl’



Sunday, April 29, 2012        © Sabre Holdings, 2012   7
Pig vs Traditional Map-Reduce
                              (Challenges/Solutions)

                                            •Problem:

                         Resources           Map-Reduce requires Java Programmer
                                            •Solution:
                                             Users familiar with scripting languages like Python/Perl can easily code.




                                            •Problem:


                         Time                Map-Reduce involves multiple stages to arrive at a solution
                                            • Solution:
                                             100 lines of Java ~ 10 lines of Pig
                                             4 hours of Java Programming ~ 15 minutes of Pig Programming




                                            •Problem:
                                             In Map-Reduce, users have to re-invent common functionalities like


                     Baked                   Join/Cross/Filter
                                            •Solution:
                                             Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction
                                             etc.



Sunday, April 29, 2012               © Sabre Holdings, 2012                                                              8
Appetite!

Pigs can digest huge datasets
  - Batch Log Processing



NOTE:
Do NOT FEED small datasets to pig. It gets angry.



Sunday, April 29, 2012    © Sabre Holdings, 2012   9
Winner in Map-Reduce Race! (1.1x)
     If Pig was first, who was second?



Any Guesses?




Sunday, April 29, 2012   © Sabre Holdings, 2012   10
How to Access Pig?




                                                       Local Mode
              MapReduce Mode
Sunday, April 29, 2012        © Sabre Holdings, 2012                11
Let’s Ride a Pig
•    LOAD
•    GENERATE, FOREACH
•    FILTERS
•    DUMP
•    STORE
•    STREAM
•    REGULAR EXPRESSION EXTRACTION
•    Group, Count, Joins
•    BAGS vs SETS?

Sunday, April 29, 2012       © Sabre Holdings, 2012   12
How can you forget this one?
• Piggy Bank
       – Pig library for already defined functions




Sunday, April 29, 2012     © Sabre Holdings, 2012    13
Theoretical Summarization

• Let us not be afraid of Swine Flu, We can still
  be friends with them.




Sunday, April 29, 2012   © Sabre Holdings, 2012     14
CASE STUDY – LOG Analytics

• Apache Access Logs



                         Let’s work on it!


Sunday, April 29, 2012         © Sabre Holdings, 2012   15
RESOURCES

• Documentation – Apache Wiki (not enough)
• Doubts –> Forums
       – Stack overflow is my favorite
• Overview
       – Cloudera Video Training
• Best Tutorial on internet:
  http://pig.apache.org/docs/r0.7.0/tutorial.ht
  ml
Sunday, April 29, 2012     © Sabre Holdings, 2012   16

Introduction to Apache Pig

  • 1.
    HADOOP SESSION-4 Introduction to Pig
  • 2.
    Session Outline What isPig? Motivation Background Components & Architecture Pig & Map-Reduce Case Study – Log Analytics Conclusion Sunday, April 29, 2012 © Sabre Holdings, 2012 2
  • 3.
    What is Pig? Frameworkfor Analyzing large Data Sets Sits on top of hadoop Sunday, April 29, 2012 © Sabre Holdings, 2012 3
  • 4.
    Pig has map-reducepowers! + = Sunday, April 29, 2012 © Sabre Holdings, 2012 4
  • 5.
    Pig Food? Pig has great taste for structured and Unstructured Data. CSV’s, TSV’s, Delimited Data Any Kind of Logs Unstructured Sentences. Databases via JDBC Connections Sunday, April 29, 2012 © Sabre Holdings, 2012 5
  • 6.
    Pig Language? Pig Understands Pig-Latin (Simple Query Algebra) - Data Flow Language - Interdependent series of operations - Allows ELT’s very effectively - Filtering/Aggregations/Applying Functions Sunday, April 29, 2012 © Sabre Holdings, 2012 6
  • 7.
    Pig is notRacist!! Pig Streaming - Pig Stream allows pig’s food to interact with alien scripts/binaries A= LOAD ‘log.txt’ C= STREAM A THROUGH ‘extractor.pl’ Sunday, April 29, 2012 © Sabre Holdings, 2012 7
  • 8.
    Pig vs TraditionalMap-Reduce (Challenges/Solutions) •Problem: Resources Map-Reduce requires Java Programmer •Solution: Users familiar with scripting languages like Python/Perl can easily code. •Problem: Time Map-Reduce involves multiple stages to arrive at a solution • Solution: 100 lines of Java ~ 10 lines of Pig 4 hours of Java Programming ~ 15 minutes of Pig Programming •Problem: In Map-Reduce, users have to re-invent common functionalities like Baked Join/Cross/Filter •Solution: Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction etc. Sunday, April 29, 2012 © Sabre Holdings, 2012 8
  • 9.
    Appetite! Pigs can digesthuge datasets - Batch Log Processing NOTE: Do NOT FEED small datasets to pig. It gets angry. Sunday, April 29, 2012 © Sabre Holdings, 2012 9
  • 10.
    Winner in Map-ReduceRace! (1.1x) If Pig was first, who was second? Any Guesses? Sunday, April 29, 2012 © Sabre Holdings, 2012 10
  • 11.
    How to AccessPig? Local Mode MapReduce Mode Sunday, April 29, 2012 © Sabre Holdings, 2012 11
  • 12.
    Let’s Ride aPig • LOAD • GENERATE, FOREACH • FILTERS • DUMP • STORE • STREAM • REGULAR EXPRESSION EXTRACTION • Group, Count, Joins • BAGS vs SETS? Sunday, April 29, 2012 © Sabre Holdings, 2012 12
  • 13.
    How can youforget this one? • Piggy Bank – Pig library for already defined functions Sunday, April 29, 2012 © Sabre Holdings, 2012 13
  • 14.
    Theoretical Summarization • Letus not be afraid of Swine Flu, We can still be friends with them. Sunday, April 29, 2012 © Sabre Holdings, 2012 14
  • 15.
    CASE STUDY –LOG Analytics • Apache Access Logs Let’s work on it! Sunday, April 29, 2012 © Sabre Holdings, 2012 15
  • 16.
    RESOURCES • Documentation –Apache Wiki (not enough) • Doubts –> Forums – Stack overflow is my favorite • Overview – Cloudera Video Training • Best Tutorial on internet: http://pig.apache.org/docs/r0.7.0/tutorial.ht ml Sunday, April 29, 2012 © Sabre Holdings, 2012 16