SlideShare a Scribd company logo
Apache Pig – Introduction and
Hands-on
Ravi Mutyala
Systems Architect, Hortonworks
Twitter: @rmutyala




© Hortonworks Inc. 2012
Big Data Platforms
Cost per TB, Adoption




                        Size of bubble = cost
                        effectiveness of solution


                        Source:




                                         2
Topics
• What is Pig?
• Why Pig ?
• Language Features
• Labs
• 0.10.0 Features
• Features in the pipeline
•Q &A




                                Page 3
      © Hortonworks Inc. 2012
What is Pig?
• System for processing large unstructured Data
• Uses HDFS and MapReduce
• Data flow Language
• Directional Asymptotic Graph
• Started at Yahoo! Research
• Joined Apache incubator in 2007
• Graduated to Subproject of Hadoop in 2008
• Top level project in Apache since 2010




                                                  Page 4
     © Hortonworks Inc. 2012
Pig Philosophy 
• Pigs eat anything
• Pigs live anywhere
• Pigs are domesticated animals
• Pigs can fly




                                  Page 5
     © Hortonworks Inc. 2012
Components
• Pig Engine – Parser, Optimizer and distributed query
  execution
• Grunt – CLI shell
• Pig Latin – Procedural Language




                                                     Page 6
     © Hortonworks Inc. 2012
Why Pig ?
• High level language that increases programmer
  productivity.
• Designed for Parallel Data flow.
• Reduces complexity by abstracting low level Map and
  Reduce jobs and Map Reduce job chaining
• Can be run on a client/gateway machine with no
  configuration on the cluster
• Multiple versions of Pig can co-exist as long as they
  are compatible with Hadoop version.




                                                     Page 7
     © Hortonworks Inc. 2012
Running Pig
Pig Latin script executes in 3 modes
• MapReduce: Code executes as MapReduce on a
  Hadoop Cluster
       $ pig myscript.pig
• Local: Code executes locally in a single JVM using
  local data
       $ pig –x local myscript.pig


• Interactive: pig with no script starts the grunt shell
  where commands can be run interactively




                                                           Page 8
      © Hortonworks Inc. 2012
GRUNT shell
• fs -ls
• fs -cat filename
• fs -copyFromLocal localfile hdfsfile




                                         Page 9
      © Hortonworks Inc. 2012
Data Types
• Scalar Types
  – int, long, float, double, chararray, bytearray, boolean, datetime
• Complex Types
  – Map. Collection of key value pairs
       – [name#alan, age#30]
  – Tuple. Ordered set of values
       – (alan,40,engineering)
  – Bags. Unordered collection of tuples
       – {(alan,40,engineering),(bob,45,sales)}




                                                                        Page 10
      © Hortonworks Inc. 2012
• Relations and a set of operations that work on
  relations
• Schema for relations is optional
• $0… $n can be used for fields in relations
• null means the data in undefined.
• Any missing or invalid fields are loaded as null




                                                     Page 11
      © Hortonworks Inc. 2012
Input and Output
• A = LOAD ‘file’ USING PigStorage(‘,’) AS
  (data1:datatype1, data2:datatype2.. )

• STORE A INTO ‘file2’ using PigStorage(‘,’)

• DUMP A

• DESCRIBE A




                                               Page 12
      © Hortonworks Inc. 2012
Relational Operations
• GROUP A BY A.age;

• FOREACH B GENERATE A.$1 – A.$3;

• FILTER A BY A.$1 > 10;

• ORDER A BY A.$1 DESC, A.$2;

• JOIN A BY A.$1, B BY B.$5;
• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2,
  B.$3);

                                                   Page 13
     © Hortonworks Inc. 2012
• LIMIT A 10;

• SAMPLE A 0.1;

• GROUP A BY A.$1 PARALLEL 10;

• User Definited Functions AND piggybank
  register 'your_path_to_piggybank/piggybank.jar';
  divs = load 'NYSE_dividends’;
  backwards = foreach divs generate
  org.apache.pig.piggybank.evaluation.string.Reverse($1);




                                                            Page 14
      © Hortonworks Inc. 2012
• Invoking static java methods

• FLATTEN

• TOKENIZE




                                 Page 15
     © Hortonworks Inc. 2012
0.10.0 Features
• Ruby UDFs
• PigStorage with schemas
• Additional UDF improvements
• Language Improvements
  – Boolean type
  – otherwise
  – Maps, Bags and Tuples can be generated without UDFs
  – Register collection of jars
• Performance Improvements




                                                          Page 16
     © Hortonworks Inc. 2012
Current work in progress
• DataTime datatype
• CUBE, ROLLUP and RANK operators
• Native support for windows
• Lower memory footprint




                                    Page 17
     © Hortonworks Inc. 2012
References
• Labs are from
  – https://github.com/alanfgates/programmingpig
  – https://github.com/michiard/CLOUDS-LAB


• 0.10.0 Features and current WIP
  – http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan
    Gates




                                                                 Page 18
     © Hortonworks Inc. 2012
Hortonworks Training
                          The expert source for
                          Apache Hadoop training & certification

Role-based Developer and Administration training
  – Coursework built and maintained by the core Apache Hadoop development team.
  – The “right” course, with the most extensive and realistic hands-on materials
  – Provide an immersive experience into real-world Hadoop scenarios
  – Public and Private courses available



Comprehensive Apache Hadoop Certification
  – Become a trusted and valuable
    Apache Hadoop expert




                                                                             Page 19
      © Hortonworks Inc. 2012
Thank You!
Questions & Answers
   Ravi Mutyala
   Systems Architect
   Hortonworks
   Twitter: @rmutyala
   www.hortonworks.com




                               Page 20
     © Hortonworks Inc. 2012

More Related Content

What's hot

YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
Hortonworks
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
J Singh
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
Jerry Wen
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
DataWorks Summit
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
DataWorks Summit
 
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
Yahoo Developer Network
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
Aditya Sakhuja
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...
ssusercda69b
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBase
Josh Elser
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
DataWorks Summit
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
Hortonworks
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
De-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServerDe-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServer
Josh Elser
 

What's hot (20)

YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBase
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
De-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServerDe-Mystifying the Apache Phoenix QueryServer
De-Mystifying the Apache Phoenix QueryServer
 

Viewers also liked

Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdp
Mark Kerzner
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS
Mark Kerzner
 
Zeta architecture -2015
Zeta architecture -2015Zeta architecture -2015
Zeta architecture -2015
MapR Technologies
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Mark Kerzner
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
Mark Kerzner
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Mark Kerzner
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
Mark Kerzner
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
Sujee Maniyam
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
elephantscale
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
SHMcloud vision
SHMcloud visionSHMcloud vision
SHMcloud vision
Mark Kerzner
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
Mark Kerzner
 

Viewers also liked (12)

Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdp
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS
 
Zeta architecture -2015
Zeta architecture -2015Zeta architecture -2015
Zeta architecture -2015
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
SHMcloud vision
SHMcloud visionSHMcloud vision
SHMcloud vision
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
 

Similar to Introduction to pig

Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
Steve Loughran
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
PatrickCrompton
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hortonworks
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
Emil Andreas Siemes
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
Eric Baldeschwieler
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
Hortonworks
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
Hortonworks
 
Hadoop In Action
Hadoop In ActionHadoop In Action
Hadoop In Action
Bigdata Meetup Kochi
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
TarjeiRomtveit
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to Habitat
Mandi Walls
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDP
Hortonworks
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
Hortonworks
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
 

Similar to Introduction to pig (20)

Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Hadoop In Action
Hadoop In ActionHadoop In Action
Hadoop In Action
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to Habitat
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDP
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

Introduction to pig

  • 1. Apache Pig – Introduction and Hands-on Ravi Mutyala Systems Architect, Hortonworks Twitter: @rmutyala © Hortonworks Inc. 2012
  • 2. Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source: 2
  • 3. Topics • What is Pig? • Why Pig ? • Language Features • Labs • 0.10.0 Features • Features in the pipeline •Q &A Page 3 © Hortonworks Inc. 2012
  • 4. What is Pig? • System for processing large unstructured Data • Uses HDFS and MapReduce • Data flow Language • Directional Asymptotic Graph • Started at Yahoo! Research • Joined Apache incubator in 2007 • Graduated to Subproject of Hadoop in 2008 • Top level project in Apache since 2010 Page 4 © Hortonworks Inc. 2012
  • 5. Pig Philosophy  • Pigs eat anything • Pigs live anywhere • Pigs are domesticated animals • Pigs can fly Page 5 © Hortonworks Inc. 2012
  • 6. Components • Pig Engine – Parser, Optimizer and distributed query execution • Grunt – CLI shell • Pig Latin – Procedural Language Page 6 © Hortonworks Inc. 2012
  • 7. Why Pig ? • High level language that increases programmer productivity. • Designed for Parallel Data flow. • Reduces complexity by abstracting low level Map and Reduce jobs and Map Reduce job chaining • Can be run on a client/gateway machine with no configuration on the cluster • Multiple versions of Pig can co-exist as long as they are compatible with Hadoop version. Page 7 © Hortonworks Inc. 2012
  • 8. Running Pig Pig Latin script executes in 3 modes • MapReduce: Code executes as MapReduce on a Hadoop Cluster $ pig myscript.pig • Local: Code executes locally in a single JVM using local data $ pig –x local myscript.pig • Interactive: pig with no script starts the grunt shell where commands can be run interactively Page 8 © Hortonworks Inc. 2012
  • 9. GRUNT shell • fs -ls • fs -cat filename • fs -copyFromLocal localfile hdfsfile Page 9 © Hortonworks Inc. 2012
  • 10. Data Types • Scalar Types – int, long, float, double, chararray, bytearray, boolean, datetime • Complex Types – Map. Collection of key value pairs – [name#alan, age#30] – Tuple. Ordered set of values – (alan,40,engineering) – Bags. Unordered collection of tuples – {(alan,40,engineering),(bob,45,sales)} Page 10 © Hortonworks Inc. 2012
  • 11. • Relations and a set of operations that work on relations • Schema for relations is optional • $0… $n can be used for fields in relations • null means the data in undefined. • Any missing or invalid fields are loaded as null Page 11 © Hortonworks Inc. 2012
  • 12. Input and Output • A = LOAD ‘file’ USING PigStorage(‘,’) AS (data1:datatype1, data2:datatype2.. ) • STORE A INTO ‘file2’ using PigStorage(‘,’) • DUMP A • DESCRIBE A Page 12 © Hortonworks Inc. 2012
  • 13. Relational Operations • GROUP A BY A.age; • FOREACH B GENERATE A.$1 – A.$3; • FILTER A BY A.$1 > 10; • ORDER A BY A.$1 DESC, A.$2; • JOIN A BY A.$1, B BY B.$5; • JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2, B.$3); Page 13 © Hortonworks Inc. 2012
  • 14. • LIMIT A 10; • SAMPLE A 0.1; • GROUP A BY A.$1 PARALLEL 10; • User Definited Functions AND piggybank register 'your_path_to_piggybank/piggybank.jar'; divs = load 'NYSE_dividends’; backwards = foreach divs generate org.apache.pig.piggybank.evaluation.string.Reverse($1); Page 14 © Hortonworks Inc. 2012
  • 15. • Invoking static java methods • FLATTEN • TOKENIZE Page 15 © Hortonworks Inc. 2012
  • 16. 0.10.0 Features • Ruby UDFs • PigStorage with schemas • Additional UDF improvements • Language Improvements – Boolean type – otherwise – Maps, Bags and Tuples can be generated without UDFs – Register collection of jars • Performance Improvements Page 16 © Hortonworks Inc. 2012
  • 17. Current work in progress • DataTime datatype • CUBE, ROLLUP and RANK operators • Native support for windows • Lower memory footprint Page 17 © Hortonworks Inc. 2012
  • 18. References • Labs are from – https://github.com/alanfgates/programmingpig – https://github.com/michiard/CLOUDS-LAB • 0.10.0 Features and current WIP – http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan Gates Page 18 © Hortonworks Inc. 2012
  • 19. Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 19 © Hortonworks Inc. 2012
  • 20. Thank You! Questions & Answers Ravi Mutyala Systems Architect Hortonworks Twitter: @rmutyala www.hortonworks.com Page 20 © Hortonworks Inc. 2012