SlideShare a Scribd company logo
HADOOP SESSION-4



   Introduction to Pig
Session Outline

What is Pig?
Motivation
Background
Components & Architecture
Pig & Map-Reduce
Case Study – Log Analytics
Conclusion

Sunday, April 29, 2012       © Sabre Holdings, 2012   2
What is Pig?

Framework for Analyzing large Data Sets
Sits on top of hadoop




Sunday, April 29, 2012     © Sabre Holdings, 2012   3
Pig has map-reduce powers!




                         +                            =
Sunday, April 29, 2012       © Sabre Holdings, 2012       4
Pig Food?
       Pig has great taste for structured and Unstructured Data.


            CSV’s, TSV’s, Delimited Data
            Any Kind of Logs
            Unstructured Sentences.
            Databases via JDBC Connections




Sunday, April 29, 2012       © Sabre Holdings, 2012                5
Pig Language?

      Pig Understands Pig-Latin (Simple Query Algebra)
      - Data Flow Language
             - Interdependent series of operations
      - Allows ELT’s very effectively
      - Filtering/Aggregations/Applying Functions




Sunday, April 29, 2012          © Sabre Holdings, 2012   6
Pig is not Racist!!

     Pig Streaming
     - Pig Stream allows pig’s food to interact with
     alien scripts/binaries

A= LOAD ‘log.txt’
C= STREAM A THROUGH ‘extractor.pl’



Sunday, April 29, 2012        © Sabre Holdings, 2012   7
Pig vs Traditional Map-Reduce
                              (Challenges/Solutions)

                                            •Problem:

                         Resources           Map-Reduce requires Java Programmer
                                            •Solution:
                                             Users familiar with scripting languages like Python/Perl can easily code.




                                            •Problem:


                         Time                Map-Reduce involves multiple stages to arrive at a solution
                                            • Solution:
                                             100 lines of Java ~ 10 lines of Pig
                                             4 hours of Java Programming ~ 15 minutes of Pig Programming




                                            •Problem:
                                             In Map-Reduce, users have to re-invent common functionalities like


                     Baked                   Join/Cross/Filter
                                            •Solution:
                                             Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction
                                             etc.



Sunday, April 29, 2012               © Sabre Holdings, 2012                                                              8
Appetite!

Pigs can digest huge datasets
  - Batch Log Processing



NOTE:
Do NOT FEED small datasets to pig. It gets angry.



Sunday, April 29, 2012    © Sabre Holdings, 2012   9
Winner in Map-Reduce Race! (1.1x)
     If Pig was first, who was second?



Any Guesses?




Sunday, April 29, 2012   © Sabre Holdings, 2012   10
How to Access Pig?




                                                       Local Mode
              MapReduce Mode
Sunday, April 29, 2012        © Sabre Holdings, 2012                11
Let’s Ride a Pig
•    LOAD
•    GENERATE, FOREACH
•    FILTERS
•    DUMP
•    STORE
•    STREAM
•    REGULAR EXPRESSION EXTRACTION
•    Group, Count, Joins
•    BAGS vs SETS?

Sunday, April 29, 2012       © Sabre Holdings, 2012   12
How can you forget this one?
• Piggy Bank
       – Pig library for already defined functions




Sunday, April 29, 2012     © Sabre Holdings, 2012    13
Theoretical Summarization

• Let us not be afraid of Swine Flu, We can still
  be friends with them.




Sunday, April 29, 2012   © Sabre Holdings, 2012     14
CASE STUDY – LOG Analytics

• Apache Access Logs



                         Let’s work on it!


Sunday, April 29, 2012         © Sabre Holdings, 2012   15
RESOURCES

• Documentation – Apache Wiki (not enough)
• Doubts –> Forums
       – Stack overflow is my favorite
• Overview
       – Cloudera Video Training
• Best Tutorial on internet:
  http://pig.apache.org/docs/r0.7.0/tutorial.ht
  ml
Sunday, April 29, 2012     © Sabre Holdings, 2012   16

More Related Content

What's hot

Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
E2MATRIX
 
Hadoop
HadoopHadoop
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
stratapps
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
Amal G Jose
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop Administrator
Edureka!
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010
Cloudera, Inc.
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Avkash Chauhan
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
Hadoop / Spark Conference Japan
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
Edureka!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Edureka!
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
Ravi Mutyala
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
J Singh
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
Geohedrick
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
Gurmukh Singh
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoop
ryanlecompte
 

What's hot (20)

Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
A Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop AdministratorA Day in the Life of a Hadoop Administrator
A Day in the Life of a Hadoop Administrator
 
Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010Yahoo! - Arun Murthy - Hadoop World 2010
Yahoo! - Arun Murthy - Hadoop World 2010
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data AnalyticsData 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoop
 

Viewers also liked

Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
Jason Shao
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pig
prash1784
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
Romain Rigaux
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
Yahoo Developer Network
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
Creditas
 
An Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAn Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in Java
Abhishek Asthana
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage Collection
Azul Systems Inc.
 
Java Garbage Collection - How it works
Java Garbage Collection - How it worksJava Garbage Collection - How it works
Java Garbage Collection - How it works
Mindfire Solutions
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoop
MinJae Kang
 
Apache Pig
Apache PigApache Pig
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 

Viewers also liked (20)

Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pig
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
An Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in JavaAn Introduction to JVM Internals and Garbage Collection in Java
An Introduction to JVM Internals and Garbage Collection in Java
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage Collection
 
Java Garbage Collection - How it works
Java Garbage Collection - How it worksJava Garbage Collection - How it works
Java Garbage Collection - How it works
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoop
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 

Similar to Introduction to Apache Pig

An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit Tour
Rory Winston
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
Gavin Heavyside
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
Kirsten Rourke
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
DataWorks Summit
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Michael Arnold
 
Building infrastructure for Big Data
Building infrastructure for Big DataBuilding infrastructure for Big Data
Building infrastructure for Big Data
PromptCloud
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
Gwen (Chen) Shapira
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
Cloudera, Inc.
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Joey Jablonski
 
Ada 2012
Ada 2012Ada 2012
Ada 2012
AdaCore
 
Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)
Andrea Delfino
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with Hadoop
Gwen (Chen) Shapira
 
The state of drupal 8 - Drupalcamp Gent
The state of drupal 8  - Drupalcamp GentThe state of drupal 8  - Drupalcamp Gent
The state of drupal 8 - Drupalcamp Gent
swentel
 
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Bjarni Kristjánsson
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
Fei Dong
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
Gavin Heavyside
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014
Repository Fringe
 

Similar to Introduction to Apache Pig (20)

An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit Tour
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Building infrastructure for Big Data
Building infrastructure for Big DataBuilding infrastructure for Big Data
Building infrastructure for Big Data
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Ada 2012
Ada 2012Ada 2012
Ada 2012
 
Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)Back-end with SonataAdminBundle (and Symfony2, of course...)
Back-end with SonataAdminBundle (and Symfony2, of course...)
 
Making Sense of Big data with Hadoop
Making Sense of Big data with HadoopMaking Sense of Big data with Hadoop
Making Sense of Big data with Hadoop
 
The state of drupal 8 - Drupalcamp Gent
The state of drupal 8  - Drupalcamp GentThe state of drupal 8  - Drupalcamp Gent
The state of drupal 8 - Drupalcamp Gent
 
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
Seminar: Embedding Optimization in Applications with MPL OptiMax - April 2012
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014
 

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Introduction to Apache Pig

  • 1. HADOOP SESSION-4 Introduction to Pig
  • 2. Session Outline What is Pig? Motivation Background Components & Architecture Pig & Map-Reduce Case Study – Log Analytics Conclusion Sunday, April 29, 2012 © Sabre Holdings, 2012 2
  • 3. What is Pig? Framework for Analyzing large Data Sets Sits on top of hadoop Sunday, April 29, 2012 © Sabre Holdings, 2012 3
  • 4. Pig has map-reduce powers! + = Sunday, April 29, 2012 © Sabre Holdings, 2012 4
  • 5. Pig Food? Pig has great taste for structured and Unstructured Data. CSV’s, TSV’s, Delimited Data Any Kind of Logs Unstructured Sentences. Databases via JDBC Connections Sunday, April 29, 2012 © Sabre Holdings, 2012 5
  • 6. Pig Language? Pig Understands Pig-Latin (Simple Query Algebra) - Data Flow Language - Interdependent series of operations - Allows ELT’s very effectively - Filtering/Aggregations/Applying Functions Sunday, April 29, 2012 © Sabre Holdings, 2012 6
  • 7. Pig is not Racist!! Pig Streaming - Pig Stream allows pig’s food to interact with alien scripts/binaries A= LOAD ‘log.txt’ C= STREAM A THROUGH ‘extractor.pl’ Sunday, April 29, 2012 © Sabre Holdings, 2012 7
  • 8. Pig vs Traditional Map-Reduce (Challenges/Solutions) •Problem: Resources Map-Reduce requires Java Programmer •Solution: Users familiar with scripting languages like Python/Perl can easily code. •Problem: Time Map-Reduce involves multiple stages to arrive at a solution • Solution: 100 lines of Java ~ 10 lines of Pig 4 hours of Java Programming ~ 15 minutes of Pig Programming •Problem: In Map-Reduce, users have to re-invent common functionalities like Baked Join/Cross/Filter •Solution: Programmers can leverage inbuilt libraries and functions for Join/Regex Extraction etc. Sunday, April 29, 2012 © Sabre Holdings, 2012 8
  • 9. Appetite! Pigs can digest huge datasets - Batch Log Processing NOTE: Do NOT FEED small datasets to pig. It gets angry. Sunday, April 29, 2012 © Sabre Holdings, 2012 9
  • 10. Winner in Map-Reduce Race! (1.1x) If Pig was first, who was second? Any Guesses? Sunday, April 29, 2012 © Sabre Holdings, 2012 10
  • 11. How to Access Pig? Local Mode MapReduce Mode Sunday, April 29, 2012 © Sabre Holdings, 2012 11
  • 12. Let’s Ride a Pig • LOAD • GENERATE, FOREACH • FILTERS • DUMP • STORE • STREAM • REGULAR EXPRESSION EXTRACTION • Group, Count, Joins • BAGS vs SETS? Sunday, April 29, 2012 © Sabre Holdings, 2012 12
  • 13. How can you forget this one? • Piggy Bank – Pig library for already defined functions Sunday, April 29, 2012 © Sabre Holdings, 2012 13
  • 14. Theoretical Summarization • Let us not be afraid of Swine Flu, We can still be friends with them. Sunday, April 29, 2012 © Sabre Holdings, 2012 14
  • 15. CASE STUDY – LOG Analytics • Apache Access Logs Let’s work on it! Sunday, April 29, 2012 © Sabre Holdings, 2012 15
  • 16. RESOURCES • Documentation – Apache Wiki (not enough) • Doubts –> Forums – Stack overflow is my favorite • Overview – Cloudera Video Training • Best Tutorial on internet: http://pig.apache.org/docs/r0.7.0/tutorial.ht ml Sunday, April 29, 2012 © Sabre Holdings, 2012 16