apache pig performance optimizations talk at apachecon 2010

•Download as PPT, PDF•

9 likes•3,265 views

Thejas Nair

Health & Medicine Technology

What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Pig Latin example Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘ pages ’ as (user, url); Jnd = join Fltrd by name, Pages by user;

Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?

Pig Compared to Map Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object]

And, You Don’t Lose Power ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pig optimization principles ,[object Object],[object Object],[object Object],[object Object]

Logical Optimizations ,[object Object],[object Object],[object Object],[object Object],Script A = load B = foreach C = filter Logical Plan A -> B -> C Parser Logical Optimizer Optimized L. Plan A -> C -> B

Physical Optimizations ,[object Object],[object Object],[object Object],Optimized L. Plan X -> Y -> Z Optimizer Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Translator

Hash Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (2, fred) (1, jane) (2, jane) (2, jane)

Skew Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ skewed’ ; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SP SP

Merge Join Pages Users aaron . . . . . . . . zach aaron . . . . . . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ merge’ ; Map 1 Map 2 Users Users Pages Pages aaron… amr aaron … amy… barb amy …

Replicated Join Pages Users aaron aaron . . . . . . . zach aaron . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ replicated’ ; Map 1 Map 2 Users Pages Pages aaron… amr aaron . zach amy… barb Users aaron . zach

Group/cogroup optimizations ,[object Object],[object Object],Pages aaron aaron barney carol . . . . . . . zach Map 1 aaron aaron barney Map 2 carol . .

Multi-store script A = load ‘ users ’ as (name, age, gender, city, state); B = filter A by name is not null ; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘ bydemo ’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘ bystate ’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’

Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package package foreach foreach

Memory Management ,[object Object],[object Object],[object Object],[object Object]

Other optimizations ,[object Object],[object Object],[object Object],[object Object]

Future optimization work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pig - fast and flexible ,[object Object],[object Object],[object Object],[object Object],[object Object],Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Further reading ,[object Object],[object Object],[object Object]

What's hot

Hadoop pigWei-Yu Chen

Hive Anatomynzhang

Upgrading To The New Map Reduce APITom Croucher

Hadoop performance optimization tipsSubhas Kumar Ghosh

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network

GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit

Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache pigJigar Parekh

Practical Hadoop using PigDavid Wellman

Introduction to MapReduce and HadoopMohamed Elsaka

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData

Apache Hadoop Shell RewriteAllen Wittenauer

Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

scalable machine learningSamir Bessalah

Ordered Record CollectionHadoop User Group

Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah

What's hot (20)

Hadoop pig

Hive Anatomy

Upgrading To The New Map Reduce API

Hadoop performance optimization tips

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

GoodFit: Multi-Resource Packing of Tasks with Dependencies

Hadoop and Pig at Twitter__HadoopSummit2010

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Apache pig

Practical Hadoop using Pig

Introduction to MapReduce and Hadoop

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...

Apache Hadoop Shell Rewrite

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

scalable machine learning

Ordered Record Collection

Big Data Analytics with Scala at SCALA.IO 2013

Viewers also liked

Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit

Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed

EEDC Apache Pig LanguageRoger Rafanell Mas

February 2014 HUG : Hive On TezYahoo Developer Network

Tuning up with Apache TezGal Vinograd

Hadoop - Apache PigVibrant Technologies & Computers

February 2014 HUG : Pig On TezYahoo Developer Network

Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll

Pig - Analyzing data setsCreditas

Yahoo's Experience Running Pig on Tez at ScaleDataWorks Summit/Hadoop Summit

Introduction to Apache PigTapan Avasthi

Tune up Yarn and Hiverxu

Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru

High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network

Pig, Making Hadoop EasyNick Dimiduk

introduction to data processing using Hadoop and PigRicardo Varela

Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar

Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil

Optimizing MapReduce Job performanceDataWorks Summit

Viewers also liked (20)

Pig on Tez: Low Latency Data Processing with Big Data

Introduction to Pig | Pig Architecture | Pig Fundamentals

EEDC Apache Pig Language

February 2014 HUG : Hive On Tez

Tuning up with Apache Tez

Hadoop - Apache Pig

February 2014 HUG : Pig On Tez

Starfish: A Self-tuning System for Big Data Analytics

Pig - Analyzing data sets

Yahoo's Experience Running Pig on Tez at Scale

Introduction to Apache Pig

Tune up Yarn and Hive

Big data components - Introduction to Flume, Pig and Sqoop

High-level Programming Languages: Apache Pig and Pig Latin

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application

Pig, Making Hadoop Easy

introduction to data processing using Hadoop and Pig

Practical Problem Solving with Apache Hadoop & Pig

Hadoop, Pig, and Twitter (NoSQL East 2009)

Optimizing MapReduce Job performance

Similar to apache pig performance optimizations talk at apachecon 2010

Introduction to Map ReduceApache Apex

Session 04 pig - slidesAnandMHadoop

20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee

03 pig introSubhas Kumar Ghosh

Introduction to map reduceBhupesh Chawda

EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum

Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma

Scaling python webapps from 0 to 50 million users - A top-down approachJinal Jhaveri

How to deal with nested lists in R? Sotrender

Hive Percona 2009prasadc

Pig on sparkSigmoid

Hadoop introductionacogoluegnes

Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc

pig.pptSheba41

Large Scale Data Processing & StorageIlayaraja P

Practical pigtrihug

pmuxmaebashi

r,rstats,r language,r packagesAjay Ohri

Pig: Data Analysis Tool in Cloud Jianfeng Zhang

Introduction To Apache Pig at WHUGAdam Kawa

Similar to apache pig performance optimizations talk at apachecon 2010 (20)

Introduction to Map Reduce

Session 04 pig - slides

20141111 파이썬으로 Hadoop MR프로그래밍

03 pig intro

Introduction to map reduce

EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Hadoop Hive Talk At IIT-Delhi

Scaling python webapps from 0 to 50 million users - A top-down approach

How to deal with nested lists in R?

Hive Percona 2009

Pig on spark

Hadoop introduction

Pig - A Data Flow Language and Execution Environment for Exploring Very Large...

pig.ppt

Large Scale Data Processing & Storage

Practical pig

pmux

r,rstats,r language,r packages

Pig: Data Analysis Tool in Cloud

Introduction To Apache Pig at WHUG

Recently uploaded

Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai

Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya

Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Miss joya

Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...narwatsonia7

Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Miss joya

Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7

Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh

CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls ServiceMiss joya

VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...narwatsonia7

Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune

Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...narwatsonia7

Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh

Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha

Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai

(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma

sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami

College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya

Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...narwatsonia7

Recently uploaded (20)

Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery

Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...

Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...

Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...

Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...

Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore

Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service

CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service

VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...

Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...

Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR

Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...

Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts

Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur

Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls

(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...

sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service

Artifacts in Nuclear Medicine with Identifying and resolving artifacts.

College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...

Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...

apache pig performance optimizations talk at apachecon 2010

1. How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations http://pig.apache.org Thejas Nair pig team @ Yahoo! Apache pig PMC member

2. What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

3. Pig Latin example Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘ pages ’ as (user, url); Jnd = join Fltrd by name, Pages by user;

4. Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?

10.

11. Hash Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (2, fred) (1, jane) (2, jane) (2, jane)

12. Skew Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ skewed’ ; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SP SP

13. Merge Join Pages Users aaron . . . . . . . . zach aaron . . . . . . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ merge’ ; Map 1 Map 2 Users Users Pages Pages aaron… amr aaron … amy… barb amy …

14. Replicated Join Pages Users aaron aaron . . . . . . . zach aaron . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ replicated’ ; Map 1 Map 2 Users Pages Pages aaron… amr aaron . zach amy… barb Users aaron . zach

15.

16. Multi-store script A = load ‘ users ’ as (name, age, gender, city, state); B = filter A by name is not null ; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘ bydemo ’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘ bystate ’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’

17. Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package package foreach foreach

18.

19.

20.

21.

22.

Editor's Notes

Pig performance has been improving because of the optimizations that keep getting added. These optimizations can be applied to other map-reduce programs as well. We will begin with a very brief introduction of pig, and then discuss query optimization strategies and techniques used in pig.
There are two aspects of pig - pig-latin the language, and the execution engine.
This is an example of what a pig script looks like. Each statement is a relation, and on the left hand side of the statement you have the name assigned to the relation. The first statement loads the user information- which can be a file on hdfs, and names the first two columns name and age. The 2nd statement fitlers the udf information based on the age. The third statement loads the pages data, where the first two columns are user and url. The last statement joins the filtered user data and page data on the user name .
But why pig and pig-latin ? Why not just use java MR ? This is what we found out, for a query , writing the problem in pig-latin meant that your code has 1/20 the number of lines, and it took you only 1/16 the development time. But there must be something to all the hardwork that was put into writing the java MapReduce code. What about performance ? There is some overhead of the pipeline of operators and the function calls you have in the MR plan generated by pig-latin, but the runtime is usually within 20% of the runtime of the map-reduce code. But if the task involves more complex operations such as join on skewed data, the chances are high that the pig query will beat the MapReduce job runtime by a large margin.
Data flow: You can write your data flows in a high level language (Pig Latin) instead of a low level language (java) that is really meant for logic flow. Standard operations: Much less code to write No need to maintain libraries of your own relational operations. Managing details of MR: No need to worry about how many map reduce jobs to decompose your work into. No need to manage data flow, fault tolerance, etc. across those set of map reduce jobs.
UDFs= User Defined Functions Metadata: Metadata is not required, but metadata supported and used when available Means no need to do create table, define schema, etc. Any files on HDFS can be read. Data model: Pig does not impose a data model on you. It works with structured or unstructured data, flat or nested data. Example of unstructured data, web pages. Example of structured data, database records. Nested data: scalar and nested data types supported. Nested data might be a list of maps or list of records inside another record. Procedural Fine grained control; one line equals one action No need to depend on an optimizer to choose actions in the (hopefully) best order for you. Pig program describes a data flow graph
Where does pig stand, compared to java MR in terms of performance ? We have what we call Pigmix, which is a set of queries used to test pig performance from release to release. It compares the performance gab between direct use of map-reduce and using pig. Performance has steadily improved across releases. And we have had 7 releases in around last two years, since it became part of apache. In the next version 0.8, which will be out in few days, the ratio is around 0.9 . The map-reduce queries in pigmix don’t have all the optimizations that are present in pig because implementing them involves a lot of effort. Not all pig optimizations are tested in pigmix. One example is skew-join in pig , it enables joining of tables where some there are large number of records for some values of the join key. The naïve implementation of join in map-reduce will run out of memory. So pigmix tells only part of the story. http://wiki.apache.org/pig/PigMix
Relational databases have a lot of optimizations for improving the query execution strategy. What makes pig different? Unlike traditional DBMS search for optimal execution plan over models of data, operators and execution environment. But systems such as pig are used in environments where accurate models are not available a priori. The data is usually in files for ease of interoperability with other tools. Operators costs can vary based on user defined functions , custom binaries/map-reduce jobs. Large clusters can have unreliable machines, it can be made of heterogenous machines, it can have different loads. Use available information such as file sizes. (eg. Consolidate small files into larger ones). Trust user user to know data properties, since pig can operate in absence of meta-data, user tells pig if it should use optimizations that work on sorted data. Use rules that should help in most cases. Eg pushing filter up early in the plan is likely to reduce data. Runtime information is used in query plan. Data is sampled for order-by query, and some joins. Potential to use information from intermediate data processing steps. Olston et al, “ Automatic Optimization of Parallel Dataﬂow Programs” http://infolab.stanford.edu/~olston/publications/usenix08.pdf
There are two stages of optimizations - logical and physical . During the logical optimization stage, the graph of dataflow operations specified through the pig query is restructured. Filtering and projecting ahead of more expensive operations is likely to reduce cost. Multiple foreach and filter statements can be combined together. Some operators can be potentially re-written, eg. Cross+filter can be converted to join in some cases.
Logical plan is compiled into physical plan which consists of sequence of map-reduce jobs that contain physical operators. Some of the optimizations are chosen using rules within pig, such as the use of combiner to reduce the data size of map output, based on weather the user defined functions are distributive and algebraic. Some other optimizations are chosen by user, for example, the user can specify the join algorithm to be used.
As your website grows, the number of unique users grows beyond what you can keep in memory. A given map only gets input from a given input source. It can therefore annotate tuples from that source with information on which source it came from. The join key is then used to partition the data, but the join key plus the input source id is used to sort it. This allows pig to buffer one side of the join keys in memory and then use that as a probe table as keys from the other input stream by.
As your website grows even more, some pages become significantly more popular than others. This means that some pages are visited by almost every user, while others are visited only by a few users. First, a sampling pass is done to determine which keys are large enough to need special attention. These are keys that have enough values that we estimate we cannot hold the entire value in memory. It’s about holding the values in memory, not the key. Then at partitioning time, those keys are handled specially. All other keys are treated as in the regular join. These selected keys from input1 are split across multiple reducers. For input2, they are replicated to each of these reducers that had the split. In this way we guarantee that every instance of key k from input1 comes into contact with every instance of k from input2.
Now lets say that for some reason you start keeping both your page view data and user data sorted by user. Note that one way to do this is make sure that pages and users are partitioned the same way. But this leads to a big problem. In order to make sure you can join all your data sets you end up using the same hash function to join them all. But rarely does one bucketing scheme make sense for all your data. Whatever is big enough for one data set will be too small for others, and vice versa. So Pig’s implementation doesn’t depend on how the data is split. Pig does this by sampling one of the inputs and then building an index from that sample that indicates the key for the first record in every split. The other input is used as the standard input file for Hadoop and is split to the maps as per normal. When the map begins processing this file, when it encounters the first key in that file it uses the index to determine where it should open the second, sampled file. It then opens the file at the appropriate point, seeks forward until it finds the key it is looking for, and then begins doing a join on the two data sources.
Now lets say that one of the inputs, users in this case, is small enough to fit into memory available for your map tasks. In that case, replicated join can be used to do the join in map itself. The large input will be used as the hadoop input to the map-reduce job and smaller input will be loaded into memory to do the join.
Very often, queries perform same set of initial operations. In such cases, the initial steps can be shared. Scan and de-serialization time can dominate the runtime in group-by queries, so sharing initial operations can result in nearly linear speed up of queries.
In this case multiple pipelines are needed in Map and Reduce phases Due to our pull based model in execution, we have split and multiplex embed the pipelines within themselves Records are tagged with the pipeline number in the map stage Grouping is done by Hadoop using a union of the keys Multiplex operator on the reducer places incoming records in the correct pipeline
Pig supports bag of objects. Group and cogroup produce bags, and in some cases such as distinct, or udfs that want to be able access as a whole (if they don’t use accumulate interface). Managing memory in java is hard. First, we created a MemoryManager that each large bag would register with, and the memory manager would register with jvm for low memory notification.When memory is low, the memory manager would spill the large bags to disk. But sometimes, the noticification was too late. Now using bags that spill to disk every time their estimated size hits configurable limit. Spill mechanism different for distinct-bags, it involves sorting first before writing to disk.
A list of some of the current optimizations that are being worked on, and some ideas for future. With the self-limiting bags, we are seeing fewer memory problems. But multiple bags in a query don’t have a shared limit.

apache pig performance optimizations talk at apachecon 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to apache pig performance optimizations talk at apachecon 2010

Similar to apache pig performance optimizations talk at apachecon 2010 (20)

Recently uploaded

Recently uploaded (20)

apache pig performance optimizations talk at apachecon 2010

Editor's Notes