Pig performance has been improving because of the optimizations that keep getting added. These optimizations can be applied to other map-reduce programs as well. We will begin with a very brief introduction of pig, and then discuss query optimization strategies and techniques used in pig.
There are two aspects of pig - pig-latin the language, and the execution engine.
This is an example of what a pig script looks like. Each statement is a relation, and on the left hand side of the statement you have the name assigned to the relation. The first statement loads the user information- which can be a file on hdfs, and names the first two columns name and age. The 2nd statement fitlers the udf information based on the age. The third statement loads the pages data, where the first two columns are user and url. The last statement joins the filtered user data and page data on the user name .
But why pig and pig-latin ? Why not just use java MR ? This is what we found out, for a query , writing the problem in pig-latin meant that your code has 1/20 the number of lines, and it took you only 1/16 the development time. But there must be something to all the hardwork that was put into writing the java MapReduce code. What about performance ? There is some overhead of the pipeline of operators and the function calls you have in the MR plan generated by pig-latin, but the runtime is usually within 20% of the runtime of the map-reduce code. But if the task involves more complex operations such as join on skewed data, the chances are high that the pig query will beat the MapReduce job runtime by a large margin.
Data flow: You can write your data flows in a high level language (Pig Latin) instead of a low level language (java) that is really meant for logic flow. Standard operations: Much less code to write No need to maintain libraries of your own relational operations. Managing details of MR: No need to worry about how many map reduce jobs to decompose your work into. No need to manage data flow, fault tolerance, etc. across those set of map reduce jobs.
UDFs= User Defined Functions Metadata: Metadata is not required, but metadata supported and used when available Means no need to do create table, define schema, etc. Any files on HDFS can be read. Data model: Pig does not impose a data model on you. It works with structured or unstructured data, flat or nested data. Example of unstructured data, web pages. Example of structured data, database records. Nested data: scalar and nested data types supported. Nested data might be a list of maps or list of records inside another record. Procedural Fine grained control; one line equals one action No need to depend on an optimizer to choose actions in the (hopefully) best order for you. Pig program describes a data flow graph
Where does pig stand, compared to java MR in terms of performance ? We have what we call Pigmix, which is a set of queries used to test pig performance from release to release. It compares the performance gab between direct use of map-reduce and using pig. Performance has steadily improved across releases. And we have had 7 releases in around last two years, since it became part of apache. In the next version 0.8, which will be out in few days, the ratio is around 0.9 . The map-reduce queries in pigmix don’t have all the optimizations that are present in pig because implementing them involves a lot of effort. Not all pig optimizations are tested in pigmix. One example is skew-join in pig , it enables joining of tables where some there are large number of records for some values of the join key. The naïve implementation of join in map-reduce will run out of memory. So pigmix tells only part of the story. http://wiki.apache.org/pig/PigMix
Relational databases have a lot of optimizations for improving the query execution strategy. What makes pig different? Unlike traditional DBMS search for optimal execution plan over models of data, operators and execution environment. But systems such as pig are used in environments where accurate models are not available a priori. The data is usually in files for ease of interoperability with other tools. Operators costs can vary based on user defined functions , custom binaries/map-reduce jobs. Large clusters can have unreliable machines, it can be made of heterogenous machines, it can have different loads. Use available information such as file sizes. (eg. Consolidate small files into larger ones). Trust user user to know data properties, since pig can operate in absence of meta-data, user tells pig if it should use optimizations that work on sorted data. Use rules that should help in most cases. Eg pushing filter up early in the plan is likely to reduce data. Runtime information is used in query plan. Data is sampled for order-by query, and some joins. Potential to use information from intermediate data processing steps. Olston et al, “ Automatic Optimization of Parallel Dataﬂow Programs” http://infolab.stanford.edu/~olston/publications/usenix08.pdf
There are two stages of optimizations - logical and physical . During the logical optimization stage, the graph of dataflow operations specified through the pig query is restructured. Filtering and projecting ahead of more expensive operations is likely to reduce cost. Multiple foreach and filter statements can be combined together. Some operators can be potentially re-written, eg. Cross+filter can be converted to join in some cases.
Logical plan is compiled into physical plan which consists of sequence of map-reduce jobs that contain physical operators. Some of the optimizations are chosen using rules within pig, such as the use of combiner to reduce the data size of map output, based on weather the user defined functions are distributive and algebraic. Some other optimizations are chosen by user, for example, the user can specify the join algorithm to be used.
As your website grows, the number of unique users grows beyond what you can keep in memory. A given map only gets input from a given input source. It can therefore annotate tuples from that source with information on which source it came from. The join key is then used to partition the data, but the join key plus the input source id is used to sort it. This allows pig to buffer one side of the join keys in memory and then use that as a probe table as keys from the other input stream by.
As your website grows even more, some pages become significantly more popular than others. This means that some pages are visited by almost every user, while others are visited only by a few users. First, a sampling pass is done to determine which keys are large enough to need special attention. These are keys that have enough values that we estimate we cannot hold the entire value in memory. It’s about holding the values in memory, not the key. Then at partitioning time, those keys are handled specially. All other keys are treated as in the regular join. These selected keys from input1 are split across multiple reducers. For input2, they are replicated to each of these reducers that had the split. In this way we guarantee that every instance of key k from input1 comes into contact with every instance of k from input2.
Now lets say that for some reason you start keeping both your page view data and user data sorted by user. Note that one way to do this is make sure that pages and users are partitioned the same way. But this leads to a big problem. In order to make sure you can join all your data sets you end up using the same hash function to join them all. But rarely does one bucketing scheme make sense for all your data. Whatever is big enough for one data set will be too small for others, and vice versa. So Pig’s implementation doesn’t depend on how the data is split. Pig does this by sampling one of the inputs and then building an index from that sample that indicates the key for the first record in every split. The other input is used as the standard input file for Hadoop and is split to the maps as per normal. When the map begins processing this file, when it encounters the first key in that file it uses the index to determine where it should open the second, sampled file. It then opens the file at the appropriate point, seeks forward until it finds the key it is looking for, and then begins doing a join on the two data sources.
Now lets say that one of the inputs, users in this case, is small enough to fit into memory available for your map tasks. In that case, replicated join can be used to do the join in map itself. The large input will be used as the hadoop input to the map-reduce job and smaller input will be loaded into memory to do the join.
Very often, queries perform same set of initial operations. In such cases, the initial steps can be shared. Scan and de-serialization time can dominate the runtime in group-by queries, so sharing initial operations can result in nearly linear speed up of queries.
In this case multiple pipelines are needed in Map and Reduce phases Due to our pull based model in execution, we have split and multiplex embed the pipelines within themselves Records are tagged with the pipeline number in the map stage Grouping is done by Hadoop using a union of the keys Multiplex operator on the reducer places incoming records in the correct pipeline
Pig supports bag of objects. Group and cogroup produce bags, and in some cases such as distinct, or udfs that want to be able access as a whole (if they don’t use accumulate interface). Managing memory in java is hard. First, we created a MemoryManager that each large bag would register with, and the memory manager would register with jvm for low memory notification.When memory is low, the memory manager would spill the large bags to disk. But sometimes, the noticification was too late. Now using bags that spill to disk every time their estimated size hits configurable limit. Spill mechanism different for distinct-bags, it involves sorting first before writing to disk.
A list of some of the current optimizations that are being worked on, and some ideas for future. With the self-limiting bags, we are seeing fewer memory problems. But multiple bags in a query don’t have a shared limit.
apache pig performance optimizations talk at apachecon 2010
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations http://pig.apache.org Thejas Nair pig team @ Yahoo! Apache pig PMC member
What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
Pig Latin example Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘ pages ’ as (user, url); Jnd = join Fltrd by name, Pages by user;
Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?
Pig Compared to Map Reduce <ul><li>Faster development time </li></ul><ul><li>Data flow versus programming logic </li></ul><ul><li>Many standard data operations (e.g. join) included </li></ul><ul><li>Manages all the details of connecting jobs and data flow </li></ul><ul><li>Copes with Hadoop version change issues </li></ul>
And, You Don’t Lose Power <ul><li>UDFs can be used to load, evaluate, aggregate, and store data </li></ul><ul><li>External binaries can be invoked </li></ul><ul><li>Metadata is optional </li></ul><ul><li>Flexible data model </li></ul><ul><li>Nested data types </li></ul><ul><li>Explicit data flow programming </li></ul>
Pig performance <ul><li>Pigmix : pig vs mapreduce </li></ul>
Pig optimization principles <ul><li>vs RDBMS: There is absence of accurate models for data, operators and execution env </li></ul><ul><li>Use available reliable info. Trust user choice. </li></ul><ul><li>Use rules that help in most cases </li></ul><ul><li>Rules based on runtime information </li></ul>
Logical Optimizations <ul><li>Restructure given logical dataflow graph </li></ul><ul><li>Apply filter, project, limit early </li></ul><ul><li>Merge foreach, filter statements </li></ul><ul><li>Operator rewrites </li></ul>Script A = load B = foreach C = filter Logical Plan A -> B -> C Parser Logical Optimizer Optimized L. Plan A -> C -> B
Physical Optimizations <ul><li>Physical plan: sequence of MR jobs having physical operators. </li></ul><ul><li>Built-in rules. eg. use of combiner </li></ul><ul><li>Specified in query - eg. join type </li></ul>Optimized L. Plan X -> Y -> Z Optimizer Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Translator
Skew Join Pages Users Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ skewed’ ; Map 1 Pages block n Map 2 Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SP SP
Merge Join Pages Users aaron . . . . . . . . zach aaron . . . . . . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ merge’ ; Map 1 Map 2 Users Users Pages Pages aaron… amr aaron … amy… barb amy …
Replicated Join Pages Users aaron aaron . . . . . . . zach aaron . zach Users = load ‘ users ’ as (name, age); Pages = load ‘ pages ’ as (user, url); Jnd = join Pages by user, Users by name using ‘ replicated’ ; Map 1 Map 2 Users Pages Pages aaron… amr aaron . zach amy… barb Users aaron . zach
Group/cogroup optimizations <ul><li>On sorted and ‘collected’ data </li></ul><ul><li>grp = group Users by name using ‘ collected’ ; </li></ul>Pages aaron aaron barney carol . . . . . . . zach Map 1 aaron aaron barney Map 2 carol . .
Multi-store script A = load ‘ users ’ as (name, age, gender, city, state); B = filter A by name is not null ; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘ bydemo ’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘ bystate ’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’
Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package package foreach foreach
Memory Management <ul><li>Use disk if large objects don’t fit into memory </li></ul><ul><li>JVM limit > phy mem - Very poor performance </li></ul><ul><li>Spill on memory threshold notification from JVM - unreliable </li></ul><ul><li>pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag. </li></ul>
Other optimizations <ul><li>Aggressive use of combiner, secondary sort </li></ul><ul><li>Lazy deserialization in loaders </li></ul><ul><li>Better serialization format </li></ul><ul><li>Faster regex lib, compiled pattern </li></ul>
Future optimization work <ul><li>Improve memory management </li></ul><ul><li>Join + group in single MR, if same keys used </li></ul><ul><li>Even better skew handling </li></ul><ul><li>Adaptive optimizations </li></ul><ul><li>Automated hadoop tuning </li></ul><ul><li>… </li></ul>
Pig - fast and flexible <ul><li>More flexibility in 0.8, 0.9 </li></ul><ul><li>Udfs in scripting languages (python) </li></ul><ul><li>MR job as relation </li></ul><ul><li>Relation as scalar </li></ul><ul><li>Turing complete pig (0.9) </li></ul>Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
Further reading <ul><li>Docs - http://pig.apache.org/docs/r0.7.0/ </li></ul><ul><li>Papers and talks - http://wiki.apache.org/pig/PigTalksPapers </li></ul><ul><li>Training videos in vimeo.com (search ‘hadoop pig’) </li></ul>