SlideShare a Scribd company logo
Putting Lipstick on Apache Pig
Big Data Gurus Meetup
August 14, 2013
Data should be accessible, easy to discover, and
easy to process for everyone.
Motivation
Big Data Users at Netflix
Analysts Engineers
Desires
Self Service
Easy
Rich Toolset Rich APIs
A Single Platform / Data Architecture that Serves Both Groups
Netflix Data Warehouse - Storage
S3 is the source of truth
Decouples storage from
processing.
Persistent data; multiple/
transient Hadoop clusters
Data sources
Event data from cloud
services via Ursula/Honu
Dimension data from
Cassandra via Aegisthus
~100 billion events processed
/ day
Petabytes of data persisted
and available to queries on
S3.
Netflix Data Platform - Processing
Long running clusters
sla and ad-hoc
Supplemental nightly
bonus clusters
For high priority ETL jobs
2,000+ instances in
aggregate across the
clusters
Netflix Hadoop Platform as a Service
S3
https://github.com/Netflix/genie
Netflix Data Platform – Primitive
Service Layer
Primitive, decoupled services
Building blocks for more
complicated
tools/services/apps
Serves 1000s of MapReduce
Jobs / day
100+ jobs concurrently
Netflix Data Platform – Tools
Sting
(Adhoc
Visualization)
Looper
(Backloading)
Forklift
(Data Movement)
Ignite
(A/B Test Analytics)
Lipstick
(Workflow
Visualization)
Spock
(Data Auditing)
Heavily utilize services in the
primitive layer.
Follow the same design
philosophy as primitive apps:
RESTful API
Decoupled javascript interfaces
Pig and Hive at Netflix
• Hive
– AdHoc queries
– Lightweight aggregation
• Pig
– Complex Dataflows / ETL
– Data movement “glue” between complex
operations
What is Pig?
• A data flow language
• Simple to learn
– Very few reserved words
– Comparable to a SQL logical query plan
• Easy to extend and optimize
• Extendable via UDFs written in multiple
languages
– Java, Python, Ruby, Groovy, Javascript
Sample Pig Script* (Word Count)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS
word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
* http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
A Typical Pig Script
Pig…
• Data flows are easy & flexible to express in text
– Facilitates code reuse via UDFs and macros
– Allows logical grouping of operations vs grouping by order
of execution.
– But errors are easy to make and overlook.
• Scripts can quickly get complicated
• Visualization quickly draws attention to:
– Common errors
– Execution order / logical flow
– Optimization opportunities
Lipstick
• Generates graphical
representations of Pig data flows.
• Compatible with Apache Pig v11+
• Has been used to monitor more
than 25,000 Pig jobs at Netflix
Lipstick
Overall Job
Progress
Logical
Plan
Overall Job
Progress
Logical Operator
(reduce side)
Logical Operator
(map side)
Map/Reduce Job
Intermediate Row Count
Records
Loaded
Hadoop
Counters
Lipstick for Fast Development
• During development:
– Keep track of data flow
– Spot common errors
• Omitted (hanging) operators
• Data type issues
– Easily estimate and optimize complexity
• Number of MR jobs generated
• Map only vs full Map/Reduce jobs
• Opportunities to rejigger logic to:
– Combine multiple jobs into a single job
– Manipulate execution order to achieve better parallelism (e.g.
less blocking)
Lipstick for Job Monitoring
• During execution:
– Graphically monitor execution status from a single
console
– Spot optimization opportunities
• Map vs reduce side joins
• Data skew
• Better parallelism settings
Lipstick for Support
• Empowers users to support themselves
– Better operational visibility
• What is my script currently doing?
• Why is my script slow?
– Examine intermediate output of jobs
– All execution information in one place
• Facilitates communication between
infrastructure / support teams and end users
– Lipstick link contains all information needed to
provide support.
Lipstick Architecture
Pig v11+
lipstick-console.jar
Lipstick Server
(RESTful
Grails app)
Javascript Client
(Frontend GUI)
RDS
Persistence
Lipstick Architecture - Console
• Implements PigProgressNotificationListener interface
• Listens for:
1. New statements to be registered (unoptimized plan)
2. Script launched event (optimized, physical, M/R plan)
3. MR Job completion/failure event
4. Heartbeat progress (during execution)
• Pig Plans and Progress  Lipstick objects
• Communicates with Lipstick Server
Pig Compilation Plans
Optimized Logical Plan
Physical Plan
MapReduce Plan
(grouping of Physical Operators into
map or reduce jobs)
Pig Script
Unoptimized Logical Plan
(~1:1 logical operator / line of Pig)
Lipstick associates Logical Operators
with MapReduce jobs by inferring
relationships between Logical and
Physical Operations.
Lipstick Architecture - Server
• Simple REST interface
• It’s a Grails app!
• Pig client posts plans and puts progress
• Javascript client
• gets plans and progress
• Searches jobs by job name and user name
Lipstick Architecture – JS Client
• Displays and annotates graphs with status / progress
• Completely decoupled from Server
• Event based design
• Periodically polls Server for job progress
• Usability is a key focus
My Job has stalled.
Solving Problems with Lipstick -
Common Problem #1
Unoptimized/Optimized
Logical Plan Toggle
Dangling
Operator
I didn’t get the data I was expecting
Common Problem #2
I don’t understand why my job failed.
Common Problem #3
Failed Job
(light red background)
Successful Job
(light blue background)
Future of Lipstick
• Annotate common errors and inefficiencies on the graph
– Skew / map side join opportunities / scalar issues
– E.g. Warnings / error dashboard
• Provide better details of runtime performance
– Timings annotated on graph
– Min / median / max mapper and reducer times
– Map / reduce completion over time
• Search through execution history
– Examine trends in runtime and data volumes
– History of failure / success
• Search jobs for commonalities
– Common datasets loaded / saved
– Better grasp data lineage
– Common uses of UDFs and macros
Lipstick on Hive
Honey?
A closer look…
Wrapping up
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• Check out the quickstart guide
– https://github.com/Netflix/Lipstick/wiki/Getting-
Started#1-quick-start
– Get started playing with Lipstick in under 5 minutes!
• We happily welcome your feedback and
contributions!
 Jeff Magnusson:
jmagnusson@netflix.com | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson
Thank you!
Jobs: http://jobs.netflix.com
Netflix OSS: http://netflix.github.io
Tech Blog: http://techblog.netflix.com/

More Related Content

What's hot

Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Databricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
ABC Talks
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be Streamed
Databricks
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 

What's hot (20)

Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be Streamed
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 

Viewers also liked

Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
knowbigdata
 
Apache spark session
Apache spark sessionApache spark session
Apache spark session
knowbigdata
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
karthika karthi
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
karthika karthi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Fetal Pig Dissection
Fetal Pig DissectionFetal Pig Dissection
Fetal Pig Dissection
Lumen Learning
 
Lipstick project
Lipstick projectLipstick project
Lipstick project
Trung Milanô
 
Apache hadoop pig overview and introduction
Apache hadoop pig overview and introductionApache hadoop pig overview and introduction
Apache hadoop pig overview and introduction
BigClasses Com
 

Viewers also liked (12)

Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Apache spark session
Apache spark sessionApache spark session
Apache spark session
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Fetal Pig Dissection
Fetal Pig DissectionFetal Pig Dissection
Fetal Pig Dissection
 
Lipstick project
Lipstick projectLipstick project
Lipstick project
 
Apache hadoop pig overview and introduction
Apache hadoop pig overview and introductionApache hadoop pig overview and introduction
Apache hadoop pig overview and introduction
 

Similar to Lipstick On Pig

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
ajay_ei
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
Greg Hanchin
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Lyft talks #4 Orchestrating big data and ML pipelines at LyftLyft talks #4 Orchestrating big data and ML pipelines at Lyft
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Constantine Slisenka
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
Amazon Web Services
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
Hadoop
HadoopHadoop
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
Kamran Munshi
 
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
Xoomworks Business Intelligence
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 

Similar to Lipstick On Pig (20)

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Lyft talks #4 Orchestrating big data and ML pipelines at LyftLyft talks #4 Orchestrating big data and ML pipelines at Lyft
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Hadoop
HadoopHadoop
Hadoop
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 

More from bigdatagurus_meetup

Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014
bigdatagurus_meetup
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)
bigdatagurus_meetup
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
bigdatagurus_meetup
 
Scaling HBase at Pinterest
Scaling HBase at PinterestScaling HBase at Pinterest
Scaling HBase at Pinterest
bigdatagurus_meetup
 
Continuuity Weave
Continuuity WeaveContinuuity Weave
Continuuity Weave
bigdatagurus_meetup
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
bigdatagurus_meetup
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
bigdatagurus_meetup
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Cloudera Developer Kit (CDK)
Cloudera Developer Kit (CDK)Cloudera Developer Kit (CDK)
Cloudera Developer Kit (CDK)
bigdatagurus_meetup
 

More from bigdatagurus_meetup (11)

Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Scaling HBase at Pinterest
Scaling HBase at PinterestScaling HBase at Pinterest
Scaling HBase at Pinterest
 
Continuuity Weave
Continuuity WeaveContinuuity Weave
Continuuity Weave
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Cloudera Developer Kit (CDK)
Cloudera Developer Kit (CDK)Cloudera Developer Kit (CDK)
Cloudera Developer Kit (CDK)
 

Recently uploaded

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 

Recently uploaded (20)

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 

Lipstick On Pig

  • 1. Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013
  • 2. Data should be accessible, easy to discover, and easy to process for everyone. Motivation
  • 3. Big Data Users at Netflix Analysts Engineers Desires Self Service Easy Rich Toolset Rich APIs A Single Platform / Data Architecture that Serves Both Groups
  • 4. Netflix Data Warehouse - Storage S3 is the source of truth Decouples storage from processing. Persistent data; multiple/ transient Hadoop clusters Data sources Event data from cloud services via Ursula/Honu Dimension data from Cassandra via Aegisthus ~100 billion events processed / day Petabytes of data persisted and available to queries on S3.
  • 5. Netflix Data Platform - Processing Long running clusters sla and ad-hoc Supplemental nightly bonus clusters For high priority ETL jobs 2,000+ instances in aggregate across the clusters
  • 6. Netflix Hadoop Platform as a Service S3 https://github.com/Netflix/genie
  • 7. Netflix Data Platform – Primitive Service Layer Primitive, decoupled services Building blocks for more complicated tools/services/apps Serves 1000s of MapReduce Jobs / day 100+ jobs concurrently
  • 8. Netflix Data Platform – Tools Sting (Adhoc Visualization) Looper (Backloading) Forklift (Data Movement) Ignite (A/B Test Analytics) Lipstick (Workflow Visualization) Spock (Data Auditing) Heavily utilize services in the primitive layer. Follow the same design philosophy as primitive apps: RESTful API Decoupled javascript interfaces
  • 9. Pig and Hive at Netflix • Hive – AdHoc queries – Lightweight aggregation • Pig – Complex Dataflows / ETL – Data movement “glue” between complex operations
  • 10. What is Pig? • A data flow language • Simple to learn – Very few reserved words – Comparable to a SQL logical query plan • Easy to extend and optimize • Extendable via UDFs written in multiple languages – Java, Python, Ruby, Groovy, Javascript
  • 11. Sample Pig Script* (Word Count) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES 'w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; * http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
  • 12. A Typical Pig Script
  • 13. Pig… • Data flows are easy & flexible to express in text – Facilitates code reuse via UDFs and macros – Allows logical grouping of operations vs grouping by order of execution. – But errors are easy to make and overlook. • Scripts can quickly get complicated • Visualization quickly draws attention to: – Common errors – Execution order / logical flow – Optimization opportunities
  • 14. Lipstick • Generates graphical representations of Pig data flows. • Compatible with Apache Pig v11+ • Has been used to monitor more than 25,000 Pig jobs at Netflix
  • 18. Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded
  • 20. Lipstick for Fast Development • During development: – Keep track of data flow – Spot common errors • Omitted (hanging) operators • Data type issues – Easily estimate and optimize complexity • Number of MR jobs generated • Map only vs full Map/Reduce jobs • Opportunities to rejigger logic to: – Combine multiple jobs into a single job – Manipulate execution order to achieve better parallelism (e.g. less blocking)
  • 21. Lipstick for Job Monitoring • During execution: – Graphically monitor execution status from a single console – Spot optimization opportunities • Map vs reduce side joins • Data skew • Better parallelism settings
  • 22. Lipstick for Support • Empowers users to support themselves – Better operational visibility • What is my script currently doing? • Why is my script slow? – Examine intermediate output of jobs – All execution information in one place • Facilitates communication between infrastructure / support teams and end users – Lipstick link contains all information needed to provide support.
  • 23. Lipstick Architecture Pig v11+ lipstick-console.jar Lipstick Server (RESTful Grails app) Javascript Client (Frontend GUI) RDS Persistence
  • 24. Lipstick Architecture - Console • Implements PigProgressNotificationListener interface • Listens for: 1. New statements to be registered (unoptimized plan) 2. Script launched event (optimized, physical, M/R plan) 3. MR Job completion/failure event 4. Heartbeat progress (during execution) • Pig Plans and Progress  Lipstick objects • Communicates with Lipstick Server
  • 25. Pig Compilation Plans Optimized Logical Plan Physical Plan MapReduce Plan (grouping of Physical Operators into map or reduce jobs) Pig Script Unoptimized Logical Plan (~1:1 logical operator / line of Pig) Lipstick associates Logical Operators with MapReduce jobs by inferring relationships between Logical and Physical Operations.
  • 26. Lipstick Architecture - Server • Simple REST interface • It’s a Grails app! • Pig client posts plans and puts progress • Javascript client • gets plans and progress • Searches jobs by job name and user name
  • 27. Lipstick Architecture – JS Client • Displays and annotates graphs with status / progress • Completely decoupled from Server • Event based design • Periodically polls Server for job progress • Usability is a key focus
  • 28. My Job has stalled. Solving Problems with Lipstick - Common Problem #1
  • 29.
  • 31. I didn’t get the data I was expecting Common Problem #2
  • 32.
  • 33.
  • 34. I don’t understand why my job failed. Common Problem #3
  • 35. Failed Job (light red background) Successful Job (light blue background)
  • 36. Future of Lipstick • Annotate common errors and inefficiencies on the graph – Skew / map side join opportunities / scalar issues – E.g. Warnings / error dashboard • Provide better details of runtime performance – Timings annotated on graph – Min / median / max mapper and reducer times – Map / reduce completion over time • Search through execution history – Examine trends in runtime and data volumes – History of failure / success • Search jobs for commonalities – Common datasets loaded / saved – Better grasp data lineage – Common uses of UDFs and macros
  • 39. Wrapping up • Lipstick is part of Netflix OSS. • Clone it on github at http://github.com/Netflix/Lipstick • Check out the quickstart guide – https://github.com/Netflix/Lipstick/wiki/Getting- Started#1-quick-start – Get started playing with Lipstick in under 5 minutes! • We happily welcome your feedback and contributions!
  • 40.  Jeff Magnusson: jmagnusson@netflix.com | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson Thank you! Jobs: http://jobs.netflix.com Netflix OSS: http://netflix.github.io Tech Blog: http://techblog.netflix.com/