Pig on Tez - Low Latency ETL with Big Data
Upcoming SlideShare
Loading in...5

Pig on Tez - Low Latency ETL with Big Data






Total Views
Views on SlideShare
Embed Views



5 Embeds 92

http://marblejenka.blogspot.jp 68
https://marblejenka.blogspot.com 10
https://www.blogger.com 9
http://feedly.com 4
http://marblejenka.blogspot.kr 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Pig on Tez - Low Latency ETL with Big Data Pig on Tez - Low Latency ETL with Big Data Presentation Transcript

  • Pig on Tez Daniel Dai @daijy Rohini Palaniswamy @rohini_pswamy H a d o o p S u m m i t 2 0 1 4 , S a n J o s e
  • Agenda  Team Introduction  Apache Pig  Why Pig on Tez?  Pig on Tez - Design - Tez features in Pig - Performance - Current status - Future Plan 2
  • 3 Apache Pig on Tez Team Daniel Dai Pig PMC Hortonworks Rohini Palaniswamy Pig PMC Yahoo! Olga Natkovich Pig PMC Yahoo! Cheolsoo Park VP Pig, Pig PMC Netflix Mark Wagner Pig Committer LinkedIn Alex Bain Pig Contributor LinkedIn View slide
  • Pig Latin  Procedural scripting language  Closer to relational algebra  Heavily used for ETL  Schema / No schema data, Pig eats everything  More than SQL and Feature rich 4 Multiquery Nested Foreach Illustrate Algebraic and Accumulator java UDFs Script Embedding Scalars Macros non-java UDFs (jython, python, javascript, groovy, jruby) Distributed Orderby, Skewed Join View slide
  • Pig users  Heavily used for ETL at Web Scale by Major Internet Companies  At Yahoo! - 60% of total hadoop jobs run daily - 12 million monthly pig jobs  Other heavy users - Twitter - Netflix - LinkedIn - Ebay - Salesforce  Standard data science tool, in university textbook 5
  • Why Pig on Tez?  DAG execution framework  Low level DAG framework - Build DAG by defining vertices and edges - Customize scheduling of DAG and routing of data  Highly customizable with pluggable implementations  Resource efficient  Performance - Without having to increase memory  Natively built on top of YARN - Multi-tenancy, resource allocation come for free  Scale  Security  Excellent support from Tez community - Bikas Saha, Siddharth Seth, Hitesh Shah 6
  • PIG on TEZ
  • Design 8 Logical Plan Tez Plan MR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler
  • DAG Plan – Split Group by + Join 9 f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Group by y Group by z Load foo Join Load g1 and Load g2 Group by y Group by z Load foo Join Multiple outputs Reduce follows reduce HDFS HDFS Split multiplex de-multiplex
  • DAG Execution - Visualization 10 Vertex 1 (Load) Vertex 2 (Group) Vertex 3 (Group) Vertex 4 (Join) MROutput MRInput
  • DAG Plan – Distributed Orderby 11 Aggregate Sample Sort Partition A = LOAD ‘foo’ AS (x, y); B = FILTER A by $0 is not null; C = ORDER f BY x; Stage sample map on distributed cache Load/Filter & Sample Aggregate Partition Sort Broadcast sample map HDFS HDFS Load/FilterHDFS HDFS Map Reduce Map Reduce Map 1-1 Unsorted Edge Cache sample map
  • Session Reuse  Feature - Session reuse  Submit more than one DAG to same AM  Usage - Each Pig script uses a single session - Grunt shell uses one session for all commands till timeout - More than one DAG submitted for merge join, ‘exec’  Benefits - A pig script with 5 MR jobs has 5 AM containers launched. Single AM for one pig script in Tez saves capacity. - Eliminates issue of queue and resource contention faced in MR by every new MR job in the pipeline of a multi-stage pig script. 12
  • Container Reuse  Features - Container reuse  Rerun new tasks on already launched containers (jvm)  Usage - Turned on by default for all pig scripts and grunt shell  Benefits - Reduced launch overhead  Container request and release overhead  Resource localization overhead  JVM launch time overhead - Reduced network IO  1-1 edge tasks are launched on same node - Object caching  User impact - Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 13
  • Custom Vertex Input/Output/Processor/Manager  Features - Custom Vertex Processor - Custom Input and Output between vertices - Custom Vertex Manager  Usage - PigProcessor instead of MapProcessor and ReduceProcessor - Unsorted input/output  with Partitioner – Union  without Partitioner – Broadcast Edge (Replicate join, Orderby and Skewed join), 1-1 Edge (Order by, Skewed join and Multiquery off) - Custom Vertex Manager – Automatic Parallelism Estimation  Benefits - No framework restrictions like MR - More efficient processing and algorithms 14
  • Broadcast Edge and Object Caching  Feature - Broadcast Edge  Broadcast same data to all tasks in successor vertices - Object Caching  Ability to cache objects in memory for scope of Vertex, DAG and Session - Input fetch on choice  Usage - Replicate join small table - Orderby and Skewed join partitioning samples  Benefits - Replace use of Distributed cache and avoid NodeManager bottleneck of localization - Avoid input fetching if in cache on container reuse - Performance gains of upto 3x in tests for replicated join on smaller clusters with higher container reuse 15
  • Vertex Groups  Feature - Vertex Grouping  Ability to group multiple vertices into one vertex group and produce a combined output  Usage - Union operator  Benefits - Better performance due to elimination of an additional vertex - Performance gains of 1.2x to 2x over MR 16 A = LOAD ‘a’; B = LOAD ‘b’; C = UNION A, B; D = GROUP C by $0; Load A Load B GROUP
  • Dynamic Parallelism  Determine parallelism beforehand is hard  Dynamic adjust parallelism at runtime  Tez VertexManagerPlugin - Custom policy to determine parallelism at runtime - Library of common policy: ShuffleVertexManager 17
  • Dynamic Parallelism - ShuffleVertexManager 18 Load A JOIN Load A JOIN 4 2 Load B Load B  Stock VertexManagerPlugin from Tez  Used by Group, Hash Join, etc  Dynamic reduce parallelism of vertex based on estimated input size
  • Dynamic Parallelism – PartitionerDefinedVertexManager  Custom VertexManagerPlugin Used by Order by / Skewed Join  Dynamic increase / decrease parallelism based on input size 19 Load/Filter & Sample Sample Aggregate Partition Sort Calculate #Parallelism
  • Performance numbers – 21 0 10 20 30 40 50 60 70 80 Prod script 1 1.5x 1 MR Job 3172 vs 3172 Tasks Prod script 2 2.1x 12 MR jobs 966 vs 941 Tasks Prod script 3 1.5x 4 MR jobs on 8.4 TB input 21397 vs 21382 Tasks Prod script 4 2% 4 MR Jobs on 25.2 TB input 101864 vs 101856 tasks Timeinmins MR Tez 28 vs 18m 11 vs 5m 50 vs 35m 74 vs 72m
  • Performance numbers – 22 0 20 40 60 80 100 120 140 160 Prod script 1 2.52x 5 MR Jobs Prod script 2 2.02x 5 MR Jobs Prod script 3 2.22x 12 MR Jobs Prod script 4 1.75x 15 MR jobs Timeinmins MR Tez 25 vs 10m 34 vs 16m 2h 22m vs 1h 21m 1h 46m vs 48m
  • Lipstick from 23
  • Performance Numbers – Interactive Query 24 0 100 200 300 400 500 600 700 10G 5G 1G 500M Timeinsecs Input Size TPC-H Q10 MR Tez 2.49X 3.41X 4.89X 6X  When the input data is small, latency dominates  Tez significantly reduce latency through session/container reuse
  • Performance Numbers – Iterative Algorithm 25  Pig can be used to implement iterative algorithm using embedding  Iterative algorithm is ideal for container reuse  Example: k-means Algorithm - Each iteration takes an average 1.48s after the first iteration (vs 27s for MR) 0 1000 2000 3000 10 50 100 Timeinsecs Iteration k-means MR Tez 14.84X 13.12X 5.37X * Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
  • Performance is proportional to …  Number of stages in the DAG - Higher the number of stages in the DAG, performance of Tez over MR will be better due to elimination of map read stages.  Size of intermediate output - More the size of intermediate output, the performance of Tez over MR will be better due to reduced HDFS usage.  Cluster/queue capacity - More congested a queue is, the performance of Tez over MR will be better due to container reuse.  Size of data in the job - For smaller data and more stages, the performance of Tez over MR will be better as percentage of launch overhead in the total time is high for smaller jobs. 26
  • Where are we?  90% feature parity with Pig on MR - No Local mode (TEZ-235) - Rarely used operators not implemented  MAPREDUCE (native mapreduce jobs)  Collected CoGroup  98% of ~1300 e2e tests pass.  35% of ~2850 unit tests pass. Porting of rest pending on Tez Local mode.  Tez branch merged into trunk and will be part of Pig 0.14 release  Netflix has Lipstick working with Pig on Tez - Credits: Jacob Perkins, Cheolsoo Park 28
  • User Impact  Tez - Zero pain deployment - Tez library installation on local disk and copy to HDFS  Pig - No pain migration from Pig on MR to Pig on Tez  Existing scripts work as is without any modification  Only two additional steps to execute in Tez mode – export TEZ_HOME=/tez-install-location – pig -x tez myscript.pig - Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 29
  • What next?  Support for Tez Local mode  All unit tests ported  Improve - Stability - Usability - Debuggability  Apache Release - Pig 0.14 with Tez released by Sep 2014  Deployment - In research in Yahoo! by early Q3 - In production in Yahoo and Netflix by Q3/Q4  Performance - From 1.2-3x to 1.5x-5x by Q4 30
  • Tez Features - WIP  Tez UI - Application Master UI and Job history UI is in the works by integrating via Application Timeline server. - Currently only AM logs are easily viewable. Task logs are available but have to grep the AM log to find the URL.  Tez Local mode  Tez AM Recovery - Tez checkpointing and resuming on AM failure is functional but needs more work. With single DAG execution of whole script, AM retries can be very costly.  Input fetch optimizations - Custom ShuffleHandler on NodeManager - Local input fetch on container reuse 31
  • What next - Performance?  Shared Edges - Same output to multiple downstream vertices  Multiple Vertex Caching  Unsorted shuffle for skewed join and order by  Custom edge manager and data routing for skewed join  Groupby and join using hashing and avoid sorting  Better memory management  Dynamic reconfiguration of DAG - Automatically determine type of join - replicate, skewed or hash join 32
  • We are hiring!!! Hortonworks Stop by Kiosk D5 Yahoo! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com. Thank You