SlideShare a Scribd company logo
Developing Pig on TezDeveloping Pig on Tez
Mark WagnerMark Wagner
Committer, Apache PigCommitter, Apache Pig
LinkedInLinkedIn
Cheolsoo ParkCheolsoo Park
VP, Apache PigVP, Apache Pig
NetflixNetflix
What is Pig
●
Apache project since 2008
●
Higher level language for Hadoop that provides a dataflow language
with a MapReduce based execution engine
A = LOAD 'input.txt';
B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO './output.txt';
Pig Concepts
●
LOAD
●
STORE
●
FOREACH ___ GENERATE ___
●
FILTER ___ BY ___
Pig Concepts
GROUP ___ BY ___
●
'Blocking' operator
●
Translates to a MapReduce shuffle
Pig Concepts
Joins:
●
Hash Join
●
Replicated Join
●
Skewed Join
Pig Latin
A = LOAD 'input.txt';
B = FOREACH A GENERATE
flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO './output.txt';
Logical Plan
Physical Plan
Map Reduce Plan Map
Reduce
What's the problem
●
Extra intermediate output
●
Artificial synchronization barriers
●
Inefficient use of resources
●
Multiquery Optimizer
●
Alleviates some problems
●
Has its own
Apache Tez
●
Incubating project
●
Express data processing as a directed acyclic graph
●
Runs on YARN
●
Aims for lower latency and higher throughput than Map Reduce
Tez Concepts
●
Job expressed as directed acyclic graph (DAG)
●
Processing done at vertices
●
Data flows along edges
Mapper
Reducer
Processor Processor
Processor
Processor
Benefits & Optimizations
●
Fewer synchronization barriers
●
Container Reuse
●
Object caches at the vertices
●
Dynamic parallelism estimation
●
Custom data transfer between processors
What we've done for Pig
●
New execution engine based on Tez
●
Physical Plan translated to Tez Plan instead of Map Reduce Plan
●
Same Physical Plan and operators
●
Custom processors run the execution plan on Tez
Along the way
●
New pluggable execution backend
●
Made operator set more generic
●
Motivated Tez improvements
Group By
LOAD
GROUP BY, SUM
Identity
GROUP BY
HDFS
LOAD
GROUP BY, STORE
GROUP BY, SUM
f = LOAD ‘foo’
AS (x:int, y:int);
g = GROUP f BY x;
h = FOREACH g GENERATE
group AS r,
SUM(f.y) as s;
i = GROUP h BY s;
Join
LOAD l, r
JOIN, STORE
LOAD r
JOIN, STORE
LOAD l
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x;
Group By
LOAD
GROUP f BY x,
GROUP f BY y
LOAD g, h
JOIN
HDFS
LOAD
JOIN
GROUP BYGROUP BY
f = LOAD ‘foo’
AS (x:int, y:int);
g = GROUP f BY x;
h = GROUP f BY y;
i = JOIN g BY group,
h BY group;
Order By
SAMPLE
AGGREGATE
PARTITION
SORT
HDFS
LOAD, SAMPLE
SORT
PARTITION
AGGREGATE
f = LOAD ‘foo’ AS (x, y);
o = ORDER f BY x;
Performance Comparison
Replicated Join
(2.8x)
Join + Group By
(1.5x)
Join + Group By +
Order By (1.5x)
3 way Split + Join
+ Group By (2.6x)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Map Reduce
Tez
How it started
Shared interests across organizations
• Similar data platform architecture.
• Pig for ETL jobs
• Hive for ad-hoc queries
How it started
Shared interests across organizations
• Hortonworks wants Tez to succeed.
Community meet-ups helped
• Twitter presented summer intern’s POC work at Tez meet-up.
• Pig devs exchanged interests.
Organizing team
Organizing team
Community meet-ups helped
• Tez team hosted tutorial sessions for Pig devs.
• Pig team got together to brainstorm implementation design.
Companies showed commitment to the project
• Hortonworks: Daniel Dai
• LinkedIn: Alex Bain, Mark Wagner
• Netflix: Cheolsoo Park
• Yahoo: Olga Natkovich, Rohini Palaniswamy
Building trust
Make Pig 2x faster within 6 months
• Hive-on-Tez showed 2x performance gain.
• Rewriting the Pig backend within 6 months seemed reasonable.
Setting goals
Acting as team
Sprint
• Monthly planning meetings
• Twice-a-week stand-up conference calls
Issues / discussions
• PIG-3446 umbrella jira for Pig on Tez
• Whiteboard discussions at meetings
• Pig old timer Daniel Dai acted as mentor.
• Everyone got to work on core functionalities.
• Everyone became an expert on the Pig backend.
Knowledge transfer
Sharing credit
• Elected as a new committer and PMC chair.
• Gave talks at Hadoop User Group and Pig User Group meet-ups.
• Speaking at ApacheCon and upcoming Hadoop Summit.
Further collaborations
Looking for more collaborations
• Parquet Hive SerDe improvements.
• Sharing experiences with SQL-on-Hadoop solutions.
Mind shift
“If we can’t hire all these good people, why don’t we use them in a
collaboration?”
• Collaboration instead of competition.
Mind shift
“Why do we reinvent the wheel?”
• Share the same technologies while creating different services.
Believe in the Apache way

More Related Content

Similar to Developing Pig on Tez (ApacheCon 2014)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
knowbigdata
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Wei-Yu Chen
 
Apache PIG
Apache PIGApache PIG
Apache PIG
Anuja Gunale
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
trihug
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)
ch koti
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
Yahoo Developer Network
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
David Morin
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
Adam Doyle
 
Apache Pig
Apache PigApache Pig
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
Hortonworks
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
When JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft productsWhen JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft products
Anthony Viard
 

Similar to Developing Pig on Tez (ApacheCon 2014) (20)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
When JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft productsWhen JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft products
 

Recently uploaded

Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 

Recently uploaded (20)

Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 

Developing Pig on Tez (ApacheCon 2014)

  • 1. Developing Pig on TezDeveloping Pig on Tez Mark WagnerMark Wagner Committer, Apache PigCommitter, Apache Pig LinkedInLinkedIn Cheolsoo ParkCheolsoo Park VP, Apache PigVP, Apache Pig NetflixNetflix
  • 2. What is Pig ● Apache project since 2008 ● Higher level language for Hadoop that provides a dataflow language with a MapReduce based execution engine A = LOAD 'input.txt'; B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO './output.txt';
  • 3. Pig Concepts ● LOAD ● STORE ● FOREACH ___ GENERATE ___ ● FILTER ___ BY ___
  • 4. Pig Concepts GROUP ___ BY ___ ● 'Blocking' operator ● Translates to a MapReduce shuffle
  • 6. Pig Latin A = LOAD 'input.txt'; B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO './output.txt';
  • 9. Map Reduce Plan Map Reduce
  • 10. What's the problem ● Extra intermediate output ● Artificial synchronization barriers ● Inefficient use of resources ● Multiquery Optimizer ● Alleviates some problems ● Has its own
  • 11. Apache Tez ● Incubating project ● Express data processing as a directed acyclic graph ● Runs on YARN ● Aims for lower latency and higher throughput than Map Reduce
  • 12. Tez Concepts ● Job expressed as directed acyclic graph (DAG) ● Processing done at vertices ● Data flows along edges Mapper Reducer Processor Processor Processor Processor
  • 13. Benefits & Optimizations ● Fewer synchronization barriers ● Container Reuse ● Object caches at the vertices ● Dynamic parallelism estimation ● Custom data transfer between processors
  • 14. What we've done for Pig ● New execution engine based on Tez ● Physical Plan translated to Tez Plan instead of Map Reduce Plan ● Same Physical Plan and operators ● Custom processors run the execution plan on Tez
  • 15. Along the way ● New pluggable execution backend ● Made operator set more generic ● Motivated Tez improvements
  • 16. Group By LOAD GROUP BY, SUM Identity GROUP BY HDFS LOAD GROUP BY, STORE GROUP BY, SUM f = LOAD ‘foo’ AS (x:int, y:int); g = GROUP f BY x; h = FOREACH g GENERATE group AS r, SUM(f.y) as s; i = GROUP h BY s;
  • 17. Join LOAD l, r JOIN, STORE LOAD r JOIN, STORE LOAD l l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x;
  • 18. Group By LOAD GROUP f BY x, GROUP f BY y LOAD g, h JOIN HDFS LOAD JOIN GROUP BYGROUP BY f = LOAD ‘foo’ AS (x:int, y:int); g = GROUP f BY x; h = GROUP f BY y; i = JOIN g BY group, h BY group;
  • 20. Performance Comparison Replicated Join (2.8x) Join + Group By (1.5x) Join + Group By + Order By (1.5x) 3 way Split + Join + Group By (2.6x) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Map Reduce Tez
  • 21. How it started Shared interests across organizations • Similar data platform architecture. • Pig for ETL jobs • Hive for ad-hoc queries
  • 22. How it started Shared interests across organizations • Hortonworks wants Tez to succeed.
  • 23. Community meet-ups helped • Twitter presented summer intern’s POC work at Tez meet-up. • Pig devs exchanged interests. Organizing team
  • 24. Organizing team Community meet-ups helped • Tez team hosted tutorial sessions for Pig devs. • Pig team got together to brainstorm implementation design.
  • 25. Companies showed commitment to the project • Hortonworks: Daniel Dai • LinkedIn: Alex Bain, Mark Wagner • Netflix: Cheolsoo Park • Yahoo: Olga Natkovich, Rohini Palaniswamy Building trust
  • 26. Make Pig 2x faster within 6 months • Hive-on-Tez showed 2x performance gain. • Rewriting the Pig backend within 6 months seemed reasonable. Setting goals
  • 27. Acting as team Sprint • Monthly planning meetings • Twice-a-week stand-up conference calls Issues / discussions • PIG-3446 umbrella jira for Pig on Tez • Whiteboard discussions at meetings
  • 28. • Pig old timer Daniel Dai acted as mentor. • Everyone got to work on core functionalities. • Everyone became an expert on the Pig backend. Knowledge transfer
  • 29. Sharing credit • Elected as a new committer and PMC chair. • Gave talks at Hadoop User Group and Pig User Group meet-ups. • Speaking at ApacheCon and upcoming Hadoop Summit.
  • 30. Further collaborations Looking for more collaborations • Parquet Hive SerDe improvements. • Sharing experiences with SQL-on-Hadoop solutions.
  • 31. Mind shift “If we can’t hire all these good people, why don’t we use them in a collaboration?” • Collaboration instead of competition.
  • 32. Mind shift “Why do we reinvent the wheel?” • Share the same technologies while creating different services.
  • 33. Believe in the Apache way