SlideShare a Scribd company logo
1 of 22
How to make your map-reduce jobs perform as well as pig:  Lessons from pig optimizations http://pig.apache.org Thejas Nair pig team @ Yahoo! Apache pig PMC member
What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
Pig Latin example Users =  load   ‘users’   as  (name, age); Fltrd =  filter  Users  by     age >= 18  and  age <= 25;  Pages =  load  ‘ pages ’  as  (user, url); Jnd =  join  Fltrd  by  name, Pages  by  user;
Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?
Pig Compared to Map Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object]
And, You Don’t Lose Power ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Pig performance ,[object Object]
Pig optimization principles ,[object Object],[object Object],[object Object],[object Object]
Logical Optimizations ,[object Object],[object Object],[object Object],[object Object],Script A = load B = foreach C = filter Logical Plan A -> B -> C Parser Logical Optimizer Optimized L. Plan A -> C -> B
Physical Optimizations ,[object Object],[object Object],[object Object],Optimized L. Plan X -> Y -> Z Optimizer Phy/MR plan M(PX-PYm) R(PYr)  ->  M(Z) Optimized Phy/MR Plan  M(PX-PYm) C(PYc)R(PYr) -> M(Z) Translator
Hash Join Pages Users Users =  load   ‘ users ’   as  (name, age); Pages =  load  ‘ pages ’  as  (user, url); Jnd =  join  Users  by  name, Pages  by  user; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (2, fred) (1, jane) (2, jane) (2, jane)
Skew Join Pages Users Users =  load   ‘ users ’   as  (name, age); Pages =  load  ‘ pages ’  as  (user, url); Jnd =  join  Pages  by  user, Users  by  name  using  ‘ skewed’ ; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SP SP
Merge Join Pages Users aaron . . . . . . . . zach aaron . . . . . . zach Users =  load   ‘ users ’   as  (name, age); Pages =  load  ‘ pages ’  as  (user, url); Jnd =  join  Pages  by  user, Users  by  name  using  ‘ merge’ ; Map 1 Map 2 Users Users Pages Pages aaron… amr aaron … amy… barb amy …
Replicated Join Pages Users aaron aaron . . . . . . . zach aaron . zach Users =  load   ‘ users ’   as  (name, age); Pages =  load  ‘ pages ’  as  (user, url); Jnd =  join  Pages  by  user, Users  by  name  using  ‘ replicated’ ; Map 1 Map 2 Users Pages Pages aaron… amr aaron .  zach amy… barb Users aaron .  zach
Group/cogroup optimizations ,[object Object],[object Object],Pages aaron aaron barney carol . . . . . . . zach Map 1 aaron aaron barney Map 2 carol . .
Multi-store script A =  load  ‘ users ’  as  (name, age, gender,    city, state); B  =  filter  A  by  name  is not null ; C1 =  group  B   by  age, gender; D1 =  foreach  C1  generate  group, COUNT(B); store  D  into  ‘ bydemo ’; C2=  group  B   by  state; D2 =  foreach  C2  generate  group, COUNT(B); store  D2  into  ‘ bystate ’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’
Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package package foreach foreach
Memory Management ,[object Object],[object Object],[object Object],[object Object]
Other optimizations  ,[object Object],[object Object],[object Object],[object Object]
Future optimization work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Pig - fast and flexible ,[object Object],[object Object],[object Object],[object Object],[object Object],Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
Further reading ,[object Object],[object Object],[object Object]

More Related Content

What's hot

Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteAllen Wittenauer
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 

What's hot (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache pig
Apache pigApache pig
Apache pig
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 

Viewers also liked

Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigTapan Avasthi
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 

Viewers also liked (20)

Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
EEDC Apache Pig Language
EEDC Apache Pig LanguageEEDC Apache Pig Language
EEDC Apache Pig Language
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 

Similar to apache pig performance optimizations talk at apachecon 2010

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachJinal Jhaveri
 
How to deal with nested lists in R?
How to deal with nested lists in R? How to deal with nested lists in R?
How to deal with nested lists in R? Sotrender
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009prasadc
 
Pig on spark
Pig on sparkPig on spark
Pig on sparkSigmoid
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionacogoluegnes
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Jianfeng Zhang
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 

Similar to apache pig performance optimizations talk at apachecon 2010 (20)

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approach
 
How to deal with nested lists in R?
How to deal with nested lists in R? How to deal with nested lists in R?
How to deal with nested lists in R?
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
 
Pig on spark
Pig on sparkPig on spark
Pig on spark
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Practical pig
Practical pigPractical pig
Practical pig
 
pmux
pmuxpmux
pmux
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 

Recently uploaded

Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Miss joya
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...narwatsonia7
 
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Miss joya
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls ServiceMiss joya
 
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...narwatsonia7
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...narwatsonia7
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya
 
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...narwatsonia7
 

Recently uploaded (20)

Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
 
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
 
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
 
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
 

apache pig performance optimizations talk at apachecon 2010

  • 1. How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations http://pig.apache.org Thejas Nair pig team @ Yahoo! Apache pig PMC member
  • 2. What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
  • 3. Pig Latin example Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘ pages ’ as (user, url); Jnd = join Fltrd by name, Pages by user;
  • 4. Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Hash Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (2, fred) (1, jane) (2, jane) (2, jane)
  • 12. Skew Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ skewed’ ; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SP SP
  • 13. Merge Join Pages Users aaron . . . . . . . . zach aaron . . . . . . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ merge’ ; Map 1 Map 2 Users Users Pages Pages aaron… amr aaron … amy… barb amy …
  • 14. Replicated Join Pages Users aaron aaron . . . . . . . zach aaron . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ replicated’ ; Map 1 Map 2 Users Pages Pages aaron… amr aaron . zach amy… barb Users aaron . zach
  • 15.
  • 16. Multi-store script A = load ‘ users ’ as (name, age, gender, city, state); B = filter A by name is not null ; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘ bydemo ’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘ bystate ’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’
  • 17. Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package package foreach foreach
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.

Editor's Notes

  1. Pig performance has been improving because of the optimizations that keep getting added. These optimizations can be applied to other map-reduce programs as well. We will begin with a very brief introduction of pig, and then discuss query optimization strategies and techniques used in pig.
  2. There are two aspects of pig - pig-latin the language, and the execution engine.
  3. This is an example of what a pig script looks like. Each statement is a relation, and on the left hand side of the statement you have the name assigned to the relation. The first statement loads the user information- which can be a file on hdfs, and names the first two columns name and age. The 2nd statement fitlers the udf information based on the age. The third statement loads the pages data, where the first two columns are user and url. The last statement joins the filtered user data and page data on the user name .
  4. But why pig and pig-latin ? Why not just use java MR ? This is what we found out, for a query , writing the problem in pig-latin meant that your code has 1/20 the number of lines, and it took you only 1/16 the development time. But there must be something to all the hardwork that was put into writing the java MapReduce code. What about performance ? There is some overhead of the pipeline of operators and the function calls you have in the MR plan generated by pig-latin, but the runtime is usually within 20% of the runtime of the map-reduce code. But if the task involves more complex operations such as join on skewed data, the chances are high that the pig query will beat the MapReduce job runtime by a large margin.
  5. Data flow: You can write your data flows in a high level language (Pig Latin) instead of a low level language (java) that is really meant for logic flow. Standard operations: Much less code to write No need to maintain libraries of your own relational operations. Managing details of MR: No need to worry about how many map reduce jobs to decompose your work into. No need to manage data flow, fault tolerance, etc. across those set of map reduce jobs.
  6. UDFs= User Defined Functions Metadata: Metadata is not required, but metadata supported and used when available Means no need to do create table, define schema, etc. Any files on HDFS can be read. Data model: Pig does not impose a data model on you. It works with structured or unstructured data, flat or nested data. Example of unstructured data, web pages. Example of structured data, database records. Nested data: scalar and nested data types supported. Nested data might be a list of maps or list of records inside another record. Procedural Fine grained control; one line equals one action No need to depend on an optimizer to choose actions in the (hopefully) best order for you. Pig program describes a data flow graph
  7. Where does pig stand, compared to java MR in terms of performance ? We have what we call Pigmix, which is a set of queries used to test pig performance from release to release. It compares the performance gab between direct use of map-reduce and using pig. Performance has steadily improved across releases. And we have had 7 releases in around last two years, since it became part of apache. In the next version 0.8, which will be out in few days, the ratio is around 0.9 . The map-reduce queries in pigmix don’t have all the optimizations that are present in pig because implementing them involves a lot of effort. Not all pig optimizations are tested in pigmix. One example is skew-join in pig , it enables joining of tables where some there are large number of records for some values of the join key. The naïve implementation of join in map-reduce will run out of memory. So pigmix tells only part of the story. http://wiki.apache.org/pig/PigMix
  8. Relational databases have a lot of optimizations for improving the query execution strategy. What makes pig different? Unlike traditional DBMS search for optimal execution plan over models of data, operators and execution environment. But systems such as pig are used in environments where accurate models are not available a priori. The data is usually in files for ease of interoperability with other tools. Operators costs can vary based on user defined functions , custom binaries/map-reduce jobs. Large clusters can have unreliable machines, it can be made of heterogenous machines, it can have different loads. Use available information such as file sizes. (eg. Consolidate small files into larger ones). Trust user user to know data properties, since pig can operate in absence of meta-data, user tells pig if it should use optimizations that work on sorted data. Use rules that should help in most cases. Eg pushing filter up early in the plan is likely to reduce data. Runtime information is used in query plan. Data is sampled for order-by query, and some joins. Potential to use information from intermediate data processing steps. Olston et al, “ Automatic Optimization of Parallel Dataflow Programs” http://infolab.stanford.edu/~olston/publications/usenix08.pdf
  9. There are two stages of optimizations - logical and physical . During the logical optimization stage, the graph of dataflow operations specified through the pig query is restructured. Filtering and projecting ahead of more expensive operations is likely to reduce cost. Multiple foreach and filter statements can be combined together. Some operators can be potentially re-written, eg. Cross+filter can be converted to join in some cases.
  10. Logical plan is compiled into physical plan which consists of sequence of map-reduce jobs that contain physical operators. Some of the optimizations are chosen using rules within pig, such as the use of combiner to reduce the data size of map output, based on weather the user defined functions are distributive and algebraic. Some other optimizations are chosen by user, for example, the user can specify the join algorithm to be used.
  11. As your website grows, the number of unique users grows beyond what you can keep in memory. A given map only gets input from a given input source. It can therefore annotate tuples from that source with information on which source it came from. The join key is then used to partition the data, but the join key plus the input source id is used to sort it. This allows pig to buffer one side of the join keys in memory and then use that as a probe table as keys from the other input stream by.
  12. As your website grows even more, some pages become significantly more popular than others. This means that some pages are visited by almost every user, while others are visited only by a few users. First, a sampling pass is done to determine which keys are large enough to need special attention. These are keys that have enough values that we estimate we cannot hold the entire value in memory. It’s about holding the values in memory, not the key. Then at partitioning time, those keys are handled specially. All other keys are treated as in the regular join. These selected keys from input1 are split across multiple reducers. For input2, they are replicated to each of these reducers that had the split. In this way we guarantee that every instance of key k from input1 comes into contact with every instance of k from input2.
  13. Now lets say that for some reason you start keeping both your page view data and user data sorted by user. Note that one way to do this is make sure that pages and users are partitioned the same way. But this leads to a big problem. In order to make sure you can join all your data sets you end up using the same hash function to join them all. But rarely does one bucketing scheme make sense for all your data. Whatever is big enough for one data set will be too small for others, and vice versa. So Pig’s implementation doesn’t depend on how the data is split. Pig does this by sampling one of the inputs and then building an index from that sample that indicates the key for the first record in every split. The other input is used as the standard input file for Hadoop and is split to the maps as per normal. When the map begins processing this file, when it encounters the first key in that file it uses the index to determine where it should open the second, sampled file. It then opens the file at the appropriate point, seeks forward until it finds the key it is looking for, and then begins doing a join on the two data sources.
  14. Now lets say that one of the inputs, users in this case, is small enough to fit into memory available for your map tasks. In that case, replicated join can be used to do the join in map itself. The large input will be used as the hadoop input to the map-reduce job and smaller input will be loaded into memory to do the join.
  15. Very often, queries perform same set of initial operations. In such cases, the initial steps can be shared. Scan and de-serialization time can dominate the runtime in group-by queries, so sharing initial operations can result in nearly linear speed up of queries.
  16. In this case multiple pipelines are needed in Map and Reduce phases Due to our pull based model in execution, we have split and multiplex embed the pipelines within themselves Records are tagged with the pipeline number in the map stage Grouping is done by Hadoop using a union of the keys Multiplex operator on the reducer places incoming records in the correct pipeline
  17. Pig supports bag of objects. Group and cogroup produce bags, and in some cases such as distinct, or udfs that want to be able access as a whole (if they don’t use accumulate interface). Managing memory in java is hard. First, we created a MemoryManager that each large bag would register with, and the memory manager would register with jvm for low memory notification.When memory is low, the memory manager would spill the large bags to disk. But sometimes, the noticification was too late. Now using bags that spill to disk every time their estimated size hits configurable limit. Spill mechanism different for distinct-bags, it involves sorting first before writing to disk.
  18. A list of some of the current optimizations that are being worked on, and some ideas for future. With the self-limiting bags, we are seeing fewer memory problems. But multiple bags in a query don’t have a shared limit.