Putting Lipstick on Apache Pig
Big Data Gurus Meetup
August 14, 2013
Data should be accessible, easy to discover, and
easy to process for everyone.
Motivation
Big Data Users at Netflix
Analysts Engineers
Desires
Self Service
Easy
Rich Toolset Rich APIs
A Single Platform / Data Arc...
Netflix Data Warehouse - Storage
S3 is the source of truth
Decouples storage from
processing.
Persistent data; multiple/
t...
Netflix Data Platform - Processing
Long running clusters
sla and ad-hoc
Supplemental nightly
bonus clusters
For high prior...
Netflix Hadoop Platform as a Service
S3
https://github.com/Netflix/genie
Netflix Data Platform – Primitive
Service Layer
Primitive, decoupled services
Building blocks for more
complicated
tools/s...
Netflix Data Platform – Tools
Sting
(Adhoc
Visualization)
Looper
(Backloading)
Forklift
(Data Movement)
Ignite
(A/B Test A...
Pig and Hive at Netflix
• Hive
– AdHoc queries
– Lightweight aggregation
• Pig
– Complex Dataflows / ETL
– Data movement “...
What is Pig?
• A data flow language
• Simple to learn
– Very few reserved words
– Comparable to a SQL logical query plan
•...
Sample Pig Script* (Word Count)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract...
A Typical Pig Script
Pig…
• Data flows are easy & flexible to express in text
– Facilitates code reuse via UDFs and macros
– Allows logical gro...
Lipstick
• Generates graphical
representations of Pig data flows.
• Compatible with Apache Pig v11+
• Has been used to mon...
Lipstick
Overall Job
Progress
Logical
Plan
Overall Job
Progress
Logical Operator
(reduce side)
Logical Operator
(map side)
Map/Reduce Job
Intermediate Row Count
Records
Loaded
Hadoop
Counters
Lipstick for Fast Development
• During development:
– Keep track of data flow
– Spot common errors
• Omitted (hanging) ope...
Lipstick for Job Monitoring
• During execution:
– Graphically monitor execution status from a single
console
– Spot optimi...
Lipstick for Support
• Empowers users to support themselves
– Better operational visibility
• What is my script currently ...
Lipstick Architecture
Pig v11+
lipstick-console.jar
Lipstick Server
(RESTful
Grails app)
Javascript Client
(Frontend GUI)
...
Lipstick Architecture - Console
• Implements PigProgressNotificationListener interface
• Listens for:
1. New statements to...
Pig Compilation Plans
Optimized Logical Plan
Physical Plan
MapReduce Plan
(grouping of Physical Operators into
map or redu...
Lipstick Architecture - Server
• Simple REST interface
• It’s a Grails app!
• Pig client posts plans and puts progress
• J...
Lipstick Architecture – JS Client
• Displays and annotates graphs with status / progress
• Completely decoupled from Serve...
My Job has stalled.
Solving Problems with Lipstick -
Common Problem #1
Unoptimized/Optimized
Logical Plan Toggle
Dangling
Operator
I didn’t get the data I was expecting
Common Problem #2
I don’t understand why my job failed.
Common Problem #3
Failed Job
(light red background)
Successful Job
(light blue background)
Future of Lipstick
• Annotate common errors and inefficiencies on the graph
– Skew / map side join opportunities / scalar ...
Lipstick on Hive
Honey?
A closer look…
Wrapping up
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• Check out the ...
 Jeff Magnusson:
jmagnusson@netflix.com | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson
Thank you!
Jobs: http://job...
Lipstick on Pig
Lipstick on Pig
Lipstick on Pig
Upcoming SlideShare
Loading in...5
×

Lipstick on Pig

142

Published on

Netflix engineer Jeff Magnusson talks about Pig Lipstick at Big Data Gurus meetup (http://www.meetup.com/BigDataGurus/)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
142
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lipstick on Pig

  1. 1. Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013
  2. 2. Data should be accessible, easy to discover, and easy to process for everyone. Motivation
  3. 3. Big Data Users at Netflix Analysts Engineers Desires Self Service Easy Rich Toolset Rich APIs A Single Platform / Data Architecture that Serves Both Groups
  4. 4. Netflix Data Warehouse - Storage S3 is the source of truth Decouples storage from processing. Persistent data; multiple/ transient Hadoop clusters Data sources Event data from cloud services via Ursula/Honu Dimension data from Cassandra via Aegisthus ~100 billion events processed / day Petabytes of data persisted and available to queries on S3.
  5. 5. Netflix Data Platform - Processing Long running clusters sla and ad-hoc Supplemental nightly bonus clusters For high priority ETL jobs 2,000+ instances in aggregate across the clusters
  6. 6. Netflix Hadoop Platform as a Service S3 https://github.com/Netflix/genie
  7. 7. Netflix Data Platform – Primitive Service Layer Primitive, decoupled services Building blocks for more complicated tools/services/apps Serves 1000s of MapReduce Jobs / day 100+ jobs concurrently
  8. 8. Netflix Data Platform – Tools Sting (Adhoc Visualization) Looper (Backloading) Forklift (Data Movement) Ignite (A/B Test Analytics) Lipstick (Workflow Visualization) Spock (Data Auditing) Heavily utilize services in the primitive layer. Follow the same design philosophy as primitive apps: RESTful API Decoupled javascript interfaces
  9. 9. Pig and Hive at Netflix • Hive – AdHoc queries – Lightweight aggregation • Pig – Complex Dataflows / ETL – Data movement “glue” between complex operations
  10. 10. What is Pig? • A data flow language • Simple to learn – Very few reserved words – Comparable to a SQL logical query plan • Easy to extend and optimize • Extendable via UDFs written in multiple languages – Java, Python, Ruby, Groovy, Javascript
  11. 11. Sample Pig Script* (Word Count) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES 'w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; * http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
  12. 12. A Typical Pig Script
  13. 13. Pig… • Data flows are easy & flexible to express in text – Facilitates code reuse via UDFs and macros – Allows logical grouping of operations vs grouping by order of execution. – But errors are easy to make and overlook. • Scripts can quickly get complicated • Visualization quickly draws attention to: – Common errors – Execution order / logical flow – Optimization opportunities
  14. 14. Lipstick • Generates graphical representations of Pig data flows. • Compatible with Apache Pig v11+ • Has been used to monitor more than 25,000 Pig jobs at Netflix
  15. 15. Lipstick
  16. 16. Overall Job Progress
  17. 17. Logical Plan Overall Job Progress
  18. 18. Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded
  19. 19. Hadoop Counters
  20. 20. Lipstick for Fast Development • During development: – Keep track of data flow – Spot common errors • Omitted (hanging) operators • Data type issues – Easily estimate and optimize complexity • Number of MR jobs generated • Map only vs full Map/Reduce jobs • Opportunities to rejigger logic to: – Combine multiple jobs into a single job – Manipulate execution order to achieve better parallelism (e.g. less blocking)
  21. 21. Lipstick for Job Monitoring • During execution: – Graphically monitor execution status from a single console – Spot optimization opportunities • Map vs reduce side joins • Data skew • Better parallelism settings
  22. 22. Lipstick for Support • Empowers users to support themselves – Better operational visibility • What is my script currently doing? • Why is my script slow? – Examine intermediate output of jobs – All execution information in one place • Facilitates communication between infrastructure / support teams and end users – Lipstick link contains all information needed to provide support.
  23. 23. Lipstick Architecture Pig v11+ lipstick-console.jar Lipstick Server (RESTful Grails app) Javascript Client (Frontend GUI) RDS Persistence
  24. 24. Lipstick Architecture - Console • Implements PigProgressNotificationListener interface • Listens for: 1. New statements to be registered (unoptimized plan) 2. Script launched event (optimized, physical, M/R plan) 3. MR Job completion/failure event 4. Heartbeat progress (during execution) • Pig Plans and Progress  Lipstick objects • Communicates with Lipstick Server
  25. 25. Pig Compilation Plans Optimized Logical Plan Physical Plan MapReduce Plan (grouping of Physical Operators into map or reduce jobs) Pig Script Unoptimized Logical Plan (~1:1 logical operator / line of Pig) Lipstick associates Logical Operators with MapReduce jobs by inferring relationships between Logical and Physical Operations.
  26. 26. Lipstick Architecture - Server • Simple REST interface • It’s a Grails app! • Pig client posts plans and puts progress • Javascript client • gets plans and progress • Searches jobs by job name and user name
  27. 27. Lipstick Architecture – JS Client • Displays and annotates graphs with status / progress • Completely decoupled from Server • Event based design • Periodically polls Server for job progress • Usability is a key focus
  28. 28. My Job has stalled. Solving Problems with Lipstick - Common Problem #1
  29. 29. Unoptimized/Optimized Logical Plan Toggle Dangling Operator
  30. 30. I didn’t get the data I was expecting Common Problem #2
  31. 31. I don’t understand why my job failed. Common Problem #3
  32. 32. Failed Job (light red background) Successful Job (light blue background)
  33. 33. Future of Lipstick • Annotate common errors and inefficiencies on the graph – Skew / map side join opportunities / scalar issues – E.g. Warnings / error dashboard • Provide better details of runtime performance – Timings annotated on graph – Min / median / max mapper and reducer times – Map / reduce completion over time • Search through execution history – Examine trends in runtime and data volumes – History of failure / success • Search jobs for commonalities – Common datasets loaded / saved – Better grasp data lineage – Common uses of UDFs and macros
  34. 34. Lipstick on Hive Honey?
  35. 35. A closer look…
  36. 36. Wrapping up • Lipstick is part of Netflix OSS. • Clone it on github at http://github.com/Netflix/Lipstick • Check out the quickstart guide – https://github.com/Netflix/Lipstick/wiki/Getting- Started#1-quick-start – Get started playing with Lipstick in under 5 minutes! • We happily welcome your feedback and contributions!
  37. 37.  Jeff Magnusson: jmagnusson@netflix.com | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson Thank you! Jobs: http://jobs.netflix.com Netflix OSS: http://netflix.github.io Tech Blog: http://techblog.netflix.com/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×