SlideShare a Scribd company logo
1 of 20
Apache Pig – Introduction and
Hands-on
Ravi Mutyala
Systems Architect, Hortonworks
Twitter: @rmutyala




© Hortonworks Inc. 2012
Big Data Platforms
Cost per TB, Adoption




                        Size of bubble = cost
                        effectiveness of solution


                        Source:




                                         2
Topics
• What is Pig?
• Why Pig ?
• Language Features
• Labs
• 0.10.0 Features
• Features in the pipeline
•Q &A




                                Page 3
      © Hortonworks Inc. 2012
What is Pig?
• System for processing large unstructured Data
• Uses HDFS and MapReduce
• Data flow Language
• Directional Asymptotic Graph
• Started at Yahoo! Research
• Joined Apache incubator in 2007
• Graduated to Subproject of Hadoop in 2008
• Top level project in Apache since 2010




                                                  Page 4
     © Hortonworks Inc. 2012
Pig Philosophy 
• Pigs eat anything
• Pigs live anywhere
• Pigs are domesticated animals
• Pigs can fly




                                  Page 5
     © Hortonworks Inc. 2012
Components
• Pig Engine – Parser, Optimizer and distributed query
  execution
• Grunt – CLI shell
• Pig Latin – Procedural Language




                                                     Page 6
     © Hortonworks Inc. 2012
Why Pig ?
• High level language that increases programmer
  productivity.
• Designed for Parallel Data flow.
• Reduces complexity by abstracting low level Map and
  Reduce jobs and Map Reduce job chaining
• Can be run on a client/gateway machine with no
  configuration on the cluster
• Multiple versions of Pig can co-exist as long as they
  are compatible with Hadoop version.




                                                     Page 7
     © Hortonworks Inc. 2012
Running Pig
Pig Latin script executes in 3 modes
• MapReduce: Code executes as MapReduce on a
  Hadoop Cluster
       $ pig myscript.pig
• Local: Code executes locally in a single JVM using
  local data
       $ pig –x local myscript.pig


• Interactive: pig with no script starts the grunt shell
  where commands can be run interactively




                                                           Page 8
      © Hortonworks Inc. 2012
GRUNT shell
• fs -ls
• fs -cat filename
• fs -copyFromLocal localfile hdfsfile




                                         Page 9
      © Hortonworks Inc. 2012
Data Types
• Scalar Types
  – int, long, float, double, chararray, bytearray, boolean, datetime
• Complex Types
  – Map. Collection of key value pairs
       – [name#alan, age#30]
  – Tuple. Ordered set of values
       – (alan,40,engineering)
  – Bags. Unordered collection of tuples
       – {(alan,40,engineering),(bob,45,sales)}




                                                                        Page 10
      © Hortonworks Inc. 2012
• Relations and a set of operations that work on
  relations
• Schema for relations is optional
• $0… $n can be used for fields in relations
• null means the data in undefined.
• Any missing or invalid fields are loaded as null




                                                     Page 11
      © Hortonworks Inc. 2012
Input and Output
• A = LOAD ‘file’ USING PigStorage(‘,’) AS
  (data1:datatype1, data2:datatype2.. )

• STORE A INTO ‘file2’ using PigStorage(‘,’)

• DUMP A

• DESCRIBE A




                                               Page 12
      © Hortonworks Inc. 2012
Relational Operations
• GROUP A BY A.age;

• FOREACH B GENERATE A.$1 – A.$3;

• FILTER A BY A.$1 > 10;

• ORDER A BY A.$1 DESC, A.$2;

• JOIN A BY A.$1, B BY B.$5;
• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2,
  B.$3);

                                                   Page 13
     © Hortonworks Inc. 2012
• LIMIT A 10;

• SAMPLE A 0.1;

• GROUP A BY A.$1 PARALLEL 10;

• User Definited Functions AND piggybank
  register 'your_path_to_piggybank/piggybank.jar';
  divs = load 'NYSE_dividends’;
  backwards = foreach divs generate
  org.apache.pig.piggybank.evaluation.string.Reverse($1);




                                                            Page 14
      © Hortonworks Inc. 2012
• Invoking static java methods

• FLATTEN

• TOKENIZE




                                 Page 15
     © Hortonworks Inc. 2012
0.10.0 Features
• Ruby UDFs
• PigStorage with schemas
• Additional UDF improvements
• Language Improvements
  – Boolean type
  – otherwise
  – Maps, Bags and Tuples can be generated without UDFs
  – Register collection of jars
• Performance Improvements




                                                          Page 16
     © Hortonworks Inc. 2012
Current work in progress
• DataTime datatype
• CUBE, ROLLUP and RANK operators
• Native support for windows
• Lower memory footprint




                                    Page 17
     © Hortonworks Inc. 2012
References
• Labs are from
  – https://github.com/alanfgates/programmingpig
  – https://github.com/michiard/CLOUDS-LAB


• 0.10.0 Features and current WIP
  – http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan
    Gates




                                                                 Page 18
     © Hortonworks Inc. 2012
Hortonworks Training
                          The expert source for
                          Apache Hadoop training & certification

Role-based Developer and Administration training
  – Coursework built and maintained by the core Apache Hadoop development team.
  – The “right” course, with the most extensive and realistic hands-on materials
  – Provide an immersive experience into real-world Hadoop scenarios
  – Public and Private courses available



Comprehensive Apache Hadoop Certification
  – Become a trusted and valuable
    Apache Hadoop expert




                                                                             Page 19
      © Hortonworks Inc. 2012
Thank You!
Questions & Answers
   Ravi Mutyala
   Systems Architect
   Hortonworks
   Twitter: @rmutyala
   www.hortonworks.com




                               Page 20
     © Hortonworks Inc. 2012

More Related Content

What's hot

YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceJ Singh
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnJerry Wen
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsDataWorks Summit
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...Yahoo Developer Network
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...ssusercda69b
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBaseJosh Elser
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemBryan Bende
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache AmbariHortonworks
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataWorks Summit
 
De-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServerDe-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServerJosh Elser
 

What's hot (20)

YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBase
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
De-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServerDe-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServer
 

Viewers also liked

Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpMark Kerzner
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Mark Kerzner
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupMark Kerzner
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data editionMark Kerzner
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiMark Kerzner
 

Viewers also liked (12)

Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdp
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS
 
Zeta architecture -2015
Zeta architecture -2015Zeta architecture -2015
Zeta architecture -2015
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
SHMcloud vision
SHMcloud visionSHMcloud vision
SHMcloud vision
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
 

Similar to Introduction to pig

Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisHortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Hortonworks
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to HabitatMandi Walls
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPHortonworks
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPHortonworks
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 

Similar to Introduction to pig (20)

Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Hadoop In Action
Hadoop In ActionHadoop In Action
Hadoop In Action
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to Habitat
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDP
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Introduction to pig

  • 1. Apache Pig – Introduction and Hands-on Ravi Mutyala Systems Architect, Hortonworks Twitter: @rmutyala © Hortonworks Inc. 2012
  • 2. Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source: 2
  • 3. Topics • What is Pig? • Why Pig ? • Language Features • Labs • 0.10.0 Features • Features in the pipeline •Q &A Page 3 © Hortonworks Inc. 2012
  • 4. What is Pig? • System for processing large unstructured Data • Uses HDFS and MapReduce • Data flow Language • Directional Asymptotic Graph • Started at Yahoo! Research • Joined Apache incubator in 2007 • Graduated to Subproject of Hadoop in 2008 • Top level project in Apache since 2010 Page 4 © Hortonworks Inc. 2012
  • 5. Pig Philosophy  • Pigs eat anything • Pigs live anywhere • Pigs are domesticated animals • Pigs can fly Page 5 © Hortonworks Inc. 2012
  • 6. Components • Pig Engine – Parser, Optimizer and distributed query execution • Grunt – CLI shell • Pig Latin – Procedural Language Page 6 © Hortonworks Inc. 2012
  • 7. Why Pig ? • High level language that increases programmer productivity. • Designed for Parallel Data flow. • Reduces complexity by abstracting low level Map and Reduce jobs and Map Reduce job chaining • Can be run on a client/gateway machine with no configuration on the cluster • Multiple versions of Pig can co-exist as long as they are compatible with Hadoop version. Page 7 © Hortonworks Inc. 2012
  • 8. Running Pig Pig Latin script executes in 3 modes • MapReduce: Code executes as MapReduce on a Hadoop Cluster $ pig myscript.pig • Local: Code executes locally in a single JVM using local data $ pig –x local myscript.pig • Interactive: pig with no script starts the grunt shell where commands can be run interactively Page 8 © Hortonworks Inc. 2012
  • 9. GRUNT shell • fs -ls • fs -cat filename • fs -copyFromLocal localfile hdfsfile Page 9 © Hortonworks Inc. 2012
  • 10. Data Types • Scalar Types – int, long, float, double, chararray, bytearray, boolean, datetime • Complex Types – Map. Collection of key value pairs – [name#alan, age#30] – Tuple. Ordered set of values – (alan,40,engineering) – Bags. Unordered collection of tuples – {(alan,40,engineering),(bob,45,sales)} Page 10 © Hortonworks Inc. 2012
  • 11. • Relations and a set of operations that work on relations • Schema for relations is optional • $0… $n can be used for fields in relations • null means the data in undefined. • Any missing or invalid fields are loaded as null Page 11 © Hortonworks Inc. 2012
  • 12. Input and Output • A = LOAD ‘file’ USING PigStorage(‘,’) AS (data1:datatype1, data2:datatype2.. ) • STORE A INTO ‘file2’ using PigStorage(‘,’) • DUMP A • DESCRIBE A Page 12 © Hortonworks Inc. 2012
  • 13. Relational Operations • GROUP A BY A.age; • FOREACH B GENERATE A.$1 – A.$3; • FILTER A BY A.$1 > 10; • ORDER A BY A.$1 DESC, A.$2; • JOIN A BY A.$1, B BY B.$5; • JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2, B.$3); Page 13 © Hortonworks Inc. 2012
  • 14. • LIMIT A 10; • SAMPLE A 0.1; • GROUP A BY A.$1 PARALLEL 10; • User Definited Functions AND piggybank register 'your_path_to_piggybank/piggybank.jar'; divs = load 'NYSE_dividends’; backwards = foreach divs generate org.apache.pig.piggybank.evaluation.string.Reverse($1); Page 14 © Hortonworks Inc. 2012
  • 15. • Invoking static java methods • FLATTEN • TOKENIZE Page 15 © Hortonworks Inc. 2012
  • 16. 0.10.0 Features • Ruby UDFs • PigStorage with schemas • Additional UDF improvements • Language Improvements – Boolean type – otherwise – Maps, Bags and Tuples can be generated without UDFs – Register collection of jars • Performance Improvements Page 16 © Hortonworks Inc. 2012
  • 17. Current work in progress • DataTime datatype • CUBE, ROLLUP and RANK operators • Native support for windows • Lower memory footprint Page 17 © Hortonworks Inc. 2012
  • 18. References • Labs are from – https://github.com/alanfgates/programmingpig – https://github.com/michiard/CLOUDS-LAB • 0.10.0 Features and current WIP – http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan Gates Page 18 © Hortonworks Inc. 2012
  • 19. Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 19 © Hortonworks Inc. 2012
  • 20. Thank You! Questions & Answers Ravi Mutyala Systems Architect Hortonworks Twitter: @rmutyala www.hortonworks.com Page 20 © Hortonworks Inc. 2012