SlideShare a Scribd company logo
1 of 1
PIG Scripting Making Pig Turing-complete through embedding in a scripting language Julien Le Dem - Yahoo Overview Pig execution model  - Map/Reduce: a solution to process big data.   - Pig: makes it easier to manipulate data on the grid.   - Pig scripting: makes Pig easier with iterative algorithms and User Defined Functions. Jiras: PIG-1479, PIG-1794 (in 0.9), PIG-928 (in 0.8) Example: Transitive closure  - Iterative process: requires a loop and a termination condition  - Requires multiple join/group by: typical Pig usage  - Requires User Defined Functions Solution 2: using Pig scripting Solution 1: plain Pig 1 file required: 7 files required: UDFs take the elements of the tuple as parameters, not a tuple. The output schema is specified using a decorator (it can be a function if you need to manipulate the input schema) Modification flow: Modification flow: UDFs use standard Python constructs (tuple/list/dictionary) automatically converted to Pig. Python functions are automatically available as UDFs Embedded Pig calls in Python Python variables can be used in the pig scripts $n, $i, ... Unfortunately this solution does not fit here. It requires 7 artifacts: ,[object Object]

More Related Content

What's hot

What's hot (9)

Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 
오픈소스 라이브러리 개발기
오픈소스 라이브러리 개발기오픈소스 라이브러리 개발기
오픈소스 라이브러리 개발기
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
The Benefits of Type Hints
The Benefits of Type HintsThe Benefits of Type Hints
The Benefits of Type Hints
 
Using MPI
Using MPIUsing MPI
Using MPI
 
Open source projects with python
Open source projects with pythonOpen source projects with python
Open source projects with python
 

Viewers also liked

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

Viewers also liked (8)

Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 

Similar to Poster Hadoop summit 2011: pig embedding in scripting languages

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 

Similar to Poster Hadoop summit 2011: pig embedding in scripting languages (20)

Multithreaded_Programming_in_Python.pdf
Multithreaded_Programming_in_Python.pdfMultithreaded_Programming_in_Python.pdf
Multithreaded_Programming_in_Python.pdf
 
Multithreading by rj
Multithreading by rjMultithreading by rj
Multithreading by rj
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep Learning
 
concurrency
concurrencyconcurrency
concurrency
 
pythontraining-201jn026043638.pptx
pythontraining-201jn026043638.pptxpythontraining-201jn026043638.pptx
pythontraining-201jn026043638.pptx
 
Python training
Python trainingPython training
Python training
 
Multicore
MulticoreMulticore
Multicore
 
Pacman game computer investigatory project
Pacman game computer investigatory projectPacman game computer investigatory project
Pacman game computer investigatory project
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!
 
PHYTON-REPORT.pdf
PHYTON-REPORT.pdfPHYTON-REPORT.pdf
PHYTON-REPORT.pdf
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
pythonlibrariesandmodules-210530042906.docx
pythonlibrariesandmodules-210530042906.docxpythonlibrariesandmodules-210530042906.docx
pythonlibrariesandmodules-210530042906.docx
 

More from Julien Le Dem

More from Julien Le Dem (19)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

Poster Hadoop summit 2011: pig embedding in scripting languages

  • 1.
  • 2. 3 Pig scripts in 3 separate files: init, main loop content, finalization. The average Pig script size is 5 to 10 lines.
  • 3.
  • 4. 3 Pig queries: init, main loop content, finalization. Each Pig query adds an average 5 to 10 lines.
  • 5. Main function: executes and coordinates the Pig scripts. It adds about 10 lines to the script.Access to the output of pig scripts