Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

•

9 likes•5,528 views

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users. From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.

Watching Pigs Fly with the
Netflix Hadoop Toolkit
Hadoop Summit 2013
San Jose, CA

Data should be accessible, easy to discover, and
easy to process for everyone.
Our Motivation

Hadoop Platform as a Service
Data Platform

Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)
Forklift
(Data Movement)
Looper
(Backloading)
Ignite
(A/B Test Analytics)
Spock
(Data Auditing)
Genie
(Hadoop PaaS)
Lipstick
(Pig Workflow
Visualization)
Event Service
(Orchestration)
Hadoop
S3
Other Processing

But, what makes good recommendations?
Similarity
Personalization

We’re Sorry
COLORS!
Box art is colorful…

Hadoop Platform as a Service
S3Cassandra TeradataRedshiftRDS

Data Platform as a Service
Franklin
(Metadata API)
S3Cassandra TeradataRedshiftRDS

Data Platform as a Service
Franklin
(Metadata API)

Whether your dataset is large or small, being
able to visualize it makes it easier to explain.

Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)

Sting
• Allows users to cache the results of a genie job
in memory
• Sub second response to OLAP style operations
(slicing, dicing, aggregations).
• Adhoc / recurring schedule
• Easy to use!

Hemlock
Grove
House of
Cards
Arrested
Development

# of subscribers X # of titles
= ???,000,…,000 (big data)
Big Data

Lipstick
• Allows users to visualize their data flow
• Allows users to see common errors
• Allows users to easily monitor their jobs
• Empowers users to support themselves
• Facilitates communication between
infrastructure team and users

Logical Operator
(reduce side)
Logical Operator
(map side)
Map/Reduce Job
Intermediate Row Count
Records
Loaded

Unoptimized/Optimized
Logical Plan Toggle
Dangling
Operator

I didn’t get the data I was expecting
Common Problem #2

I don’t understand why my job failed.
Common Problem #3

Failed Job
(light red background)
Successful Job
(light blue background)

Wrapping up
• Demos at the Netflix booth in the exhibit hall
(see more Lipstick, Sting, and Genie).
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• We welcome feedback and contributions!

 Charles Smith: charsmith@netflix.com
 Jeff Magnusson: jmagnusson@netflix.com
Thank you!
Jobs: http://jobs.netflix.com
Netflix OSS: http://netflix.github.io
Tech Blog: http://techblog.netflix.com/

Building a data platform doesn’t have to be like entering a portal to Stranger Things. Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale. Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.

Putting Lipstick on Apache Pig at Netflix

Jeff Magnusson

Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013 This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts. Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...

Data Con LA

This talk explores how Netflix equips its engineers with the freedom to find and introduce the right software for the job - even if it isn't used anywhere else in-house. Examples include how Netflix has enabled analysts to fluidly switch between MPP RDBMS and an auto-scaling Presto cluster, how Spark + NoSQL stores are used when deploying data sets to internal web apps, and how data scientists are enabled to work in the ML framework of their choosing and deploy models as a service.

Netflix Data Engineering @ Uber Engineering Meetup

Blake Irvine

Rapid Data Analytics @ Netflix

Data Con LA

At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility". How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company? How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos? We'll talk about how Netflix equips its business intelligence and data engineers with: the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house the freedom to create and drop new tables in production without approval the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool the freedom to retire analytics and data processes whose value doesn't justify their support costs Speaker Bios Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace. Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco. Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.

Data Warehousing Patterns for Hadoop

Michelle Ufford

Slides from Michelle Ufford's Data Warehousing talk at Hadoop Summit 2015. How can we take advantage of the veritable treasure trove of data stored in Hadoop to augment our traditional data warehouses? In this session, Michelle will share her experience with migrating GoDaddy’s data warehouse to Hadoop. She’ll explore how GoDaddy has adapted traditional data warehousing methodologies to work with Hadoop and will share example ETL patterns used by her team. Topics will also include how the integration of structured and unstructured data has exposed new insights, the resulting business impact, and tips for making your own Hadoop migration project more successful. Recording available here: https://www.youtube.com/watch?v=0AxoB-wJcZc

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Spark Summit

In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.

Spark at Airbnb

Hao Wang

I presented this at a 2014 Tableau Conference session with Albert Wong. Netflix relies on data to make decisions ranging from buying and recommending content, to improving the streaming experience on devices. This presentation shares our Big Data analytics architecture and the tools used to make data accessible throughout our business, focusing on how Tableau fits into our organization and why it aligns well with our culture.

Big Data Meets Learning Science: Keynote by Al Essa

Spark Summit

How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Spark Summit

Realtime streaming architecture in INFINARIO

Jozo Kovac

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Spark Summit

HP ships millions of PCs, Printers, and other devices every year to customers in all market segments. More customers are seeking services provided with our products enabling new opportunities for HP to create services from the data we can collect from our devices. Every device we ship is an IoT endpoint with powerful CPU to capture rich data. Insights from this data are used internally to improve our products and focus on customer needs. In this presentation, John will focus on HP’s journey to enabling Big Data analytics from within a large enterprise environment. He will review the challenges and how HP decided on AWS, Apache Spark and Databricks as the foundation for their entry into Big Data Analytics. John will also review how HP uses Spark to build analytic services from the data they generate from their devices.

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Databricks

The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!

No sql and sql - open analytics summitOpen Analytics

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more. Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook! We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.

Using Hadoop to build a Data Quality Service for both real-time and batch data

DataWorks Summit/Hadoop Summit

Data Warehousing with Spark Streaming at Zalando

Databricks

Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time. The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Spark Summit

Apache Spark was designed as a batch analytics system. By caching RDDs, Spark speeds up jobs that iteratively process the same data. This pattern is also applicable to online analytics. We use Bloomberg’s Spark Server as a server runtime for online analytics. Our framework implements certain useful patterns applicable to online query processing and is centered on the idea of “Managed” DataFrames that can be refreshed and updated as per user requirements, without violating the immutability of RDDs/DataFrames. However, Spark presents significant challenges with respect to availability and resilience in an online setting where Spark is required to respond to queries with high SLAs. In this talk, we try to identify specific areas where slow-down or failures can result in the largest hits on online-query performance and potential solutions to address these.

Disrupting Big Data with Apache Spark in the Cloud

Jen Aman

Shifting Data Science into High Gear

Spark Summit

Bridging the Gap Between Datasets and DataFrames

Databricks

Apple leverages Apache Spark for processing large datasets to power key components of Apple's production services. The majority of users rely on Spark SQL to benefit from state-of-the-art optimizations in Catalyst and Tungsten. As there are multiple APIs to interact with Spark SQL, users have to make a wise decision which one to pick. While DataFrames and SQL are widely used, they lack type safety so that the analysis errors will not be detected during the compile time such as invalid column names or types. Also, the ability to apply the same functional constructions as on RDDs is missing in DataFrames. Datasets expose a type-safe API and support for user-defined closures at the cost of performance. This talk will explain cases when Spark SQL cannot optimize typed Datasets as much as it can optimize DataFrames. We will also present an effort to use bytecode analysis to convert user-defined closures into native Catalyst expressions. This helps Spark to avoid the expensive conversion between the internal format and JVM objects as well as to leverage more Catalyst optimizations. A consequence, we can bridge the gap in performance between Datasets and DataFrames, so that users do not have to sacrifice the benefits of Datasets for performance reasons.

How Spark Enables the Internet of Things- Paula Ta-Shma

Spark Summit

A Production Quality Sketching Library for the Analysis of Big Data

Databricks

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Databricks

Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations. Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.

Extracting Insights from Data at Twitter

Prasad Wagle

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Spark Summit

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Databricks

This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies. Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.

Netflix: Wachstumsstrategie zeigt Wirkung

Stefan Böhm

OSCON 2015

Charles Smith

What's hot

DATA @ NFLX (Tableau Conference 2014 Presentation)

Blake Irvine

Big Data Meets Learning Science: Keynote by Al Essa

Spark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Spark Summit

Realtime streaming architecture in INFINARIO

Jozo Kovac

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Spark Summit

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Databricks

No sql and sql - open analytics summitOpen Analytics

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

Using Hadoop to build a Data Quality Service for both real-time and batch data

DataWorks Summit/Hadoop Summit

Data Warehousing with Spark Streaming at Zalando

Databricks

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Spark Summit

Disrupting Big Data with Apache Spark in the Cloud

Jen Aman

Shifting Data Science into High Gear

Spark Summit

Bridging the Gap Between Datasets and DataFrames

Databricks

How Spark Enables the Internet of Things- Paula Ta-Shma

Spark Summit

A Production Quality Sketching Library for the Analysis of Big Data

Databricks

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Databricks

Extracting Insights from Data at Twitter

Prasad Wagle

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Spark Summit

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Databricks

What's hot (20)

DATA @ NFLX (Tableau Conference 2014 Presentation)

Big Data Meets Learning Science: Keynote by Al Essa

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Realtime streaming architecture in INFINARIO

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

No sql and sql - open analytics summit

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Using Hadoop to build a Data Quality Service for both real-time and batch data

Data Warehousing with Spark Streaming at Zalando

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Disrupting Big Data with Apache Spark in the Cloud

Shifting Data Science into High Gear

Bridging the Gap Between Datasets and DataFrames

How Spark Enables the Internet of Things- Paula Ta-Shma

A Production Quality Sketching Library for the Analysis of Big Data

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Extracting Insights from Data at Twitter

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Viewers also liked

Netflix: Wachstumsstrategie zeigt Wirkung

Stefan Böhm

OSCON 2015

Charles Smith

Orgenealegna301

Data Governance - Atlas 7.12.2015

Hortonworks

Effective data governance is imperative to the success of Data Lake initiatives. Without governance policies and processes, information discovery and analysis is severely impaired. In this session we will provide an in-depth look into the Data Governance Initiative launched collaboratively between Hortonworks and partners from across industries. We will cover the objectives of Data Governance Initiatives and demonstrate key governance capabilities of the Hortonworks Data Platform.

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Amazon Web Services

In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...

DataWorks Summit/Hadoop Summit

Viewers also liked (6)

Netflix: Wachstumsstrategie zeigt Wirkung

OSCON 2015

Orgene

Data Governance - Atlas 7.12.2015

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...

Similar to Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Rohit Kulkarni

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Chris Baglieri

Ncku csie talk about Spark

Giivee The

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Big Data Spain

http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks. Session presented at Big Data Spain 2014 Conference 18th Nov 2014 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

Dataiku

How Concur uses Big Data to get you to Tableau Conference On Time

Denny Lee

Big Data - HDInsight and Power BI

Prasad Prabhu (PP)

Big Data is one of the hot topics and has got the attention of the IT industry globally. It is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk. This presentation focuses on why, what, how of big data as we explore some of Microsoft's big data solutions - HDInsight azure service and PowerBI, providing insights into the world of Big data.

Why apache Flink is the 4G of Big Data Analytics Frameworks

Slim Baltagi

Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases. Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table). At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly). In this talk, you will learn in more details about: What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Chetan Khatri

Data Infrastructure for a World of Music

Lars Albertsson

The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.

Lipstick On Pig

bigdatagurus_meetup

Netflix - Pig with Lipstick by Jeff Magnusson

Hakka Labs

In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D. While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts). Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.

A general introduction to Spring Data / Neo4J

Florent Biville

Architecting the Future of Big Data and Search

Hortonworks

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Codemotion

What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.

Hadoop basics

Antonio Silveira

Hadoop with Python

Donald Miner

Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.

PySaprkGiivee The

The Rise of the DataOps - Dataiku - J On the Beach 2016

Dataiku

Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams…. But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production. In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models. Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk

How Graph Databases used in Police Department?

Samet KILICTAS

Similar to Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013) (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Ncku csie talk about Spark

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

How Concur uses Big Data to get you to Tableau Conference On Time

Big Data - HDInsight and Power BI

Why apache Flink is the 4G of Big Data Analytics Frameworks

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Data Infrastructure for a World of Music

Lipstick On Pig

Netflix - Pig with Lipstick by Jeff Magnusson

A general introduction to Spring Data / Neo4J

Architecting the Future of Big Data and Search

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Hadoop basics

Hadoop with Python

PySaprk

The Rise of the DataOps - Dataiku - J On the Beach 2016

How Graph Databases used in Police Department?

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Knowledge engineering: from people to machines and back

Elena Simperl

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

The Future of Platform Engineering

Jemma Hussein Allen

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

DevOps and Testing slides at DASA Connect

Kari Kakkonen

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Elevating Tactical DDD Patterns Through Object Calisthenics

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Knowledge engineering: from people to machines and back

UiPath Test Automation using UiPath Test Suite series, part 3

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

The Future of Platform Engineering

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

DevOps and Testing slides at DASA Connect

How world-class product teams are winning in the AI era by CEO and Founder, P...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

GraphRAG is All You need? LLM & Knowledge Graph

Connector Corner: Automate dynamic content and events by pushing a button

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

1. Watching Pigs Fly with the Netflix Hadoop Toolkit Hadoop Summit 2013 San Jose, CA

2. Data should be accessible, easy to discover, and easy to process for everyone. Our Motivation

3. Our Users Analysts Engineers

4. Hadoop Platform as a Service

5. Hadoop Platform as a Service S3

6. Hadoop Platform as a Service Data Platform

7. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization) Forklift (Data Movement) Looper (Backloading) Ignite (A/B Test Analytics) Spock (Data Auditing) Genie (Hadoop PaaS) Lipstick (Pig Workflow Visualization) Event Service (Orchestration) Hadoop S3 Other Processing

8. Let’s solve a problem using the data!

9. Build a recommender.

10. But, what makes good recommendations? Similarity Personalization

11. COLORS!

12. COLORS! Box art is colorful…

13. We’re Sorry COLORS! Box art is colorful…

14. Where can I find the data?

15. Hadoop Platform as a Service S3

16. Hadoop Platform as a Service S3Cassandra TeradataRedshiftRDS

17. Data Platform as a Service Franklin (Metadata API) S3Cassandra TeradataRedshiftRDS

18. Data Platform as a Service Franklin (Metadata API)

19. Create a dataset for box art and color.

20. Whether your dataset is large or small, being able to visualize it makes it easier to explain.

21. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization)

22. Sting • Allows users to cache the results of a genie job in memory • Sub second response to OLAP style operations (slicing, dicing, aggregations). • Adhoc / recurring schedule • Easy to use!

23. Hive Query Schema

24. % Content Consumed / Hour

25. Hemlock Grove House of Cards Arrested Development

26. Similarity

27.

28.

29. House of Cards Macbeth

30.

31.

32. Toddlers & Tiaras Star Trek: Voyager

33. Personalization

34. # of subscribers X # of titles = ???,000,…,000 (big data) Big Data

35. Netflix Apache Pig

36.

37. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization)

38. Lipstick • Allows users to visualize their data flow • Allows users to see common errors • Allows users to easily monitor their jobs • Empowers users to support themselves • Facilitates communication between infrastructure team and users

39. Lipstick

40. Overall Job Progress

41. Logical Plan Overall Job Progress

42. Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded

43. Hadoop Counters

44. My Job has stalled. Common Problem #1

45.

46. Unoptimized/Optimized Logical Plan Toggle Dangling Operator

47. I didn’t get the data I was expecting Common Problem #2

48.

49.

50. I don’t understand why my job failed. Common Problem #3

51. Failed Job (light red background) Successful Job (light blue background)

52.

53. Wrapping up • Demos at the Netflix booth in the exhibit hall (see more Lipstick, Sting, and Genie). • Lipstick is part of Netflix OSS. • Clone it on github at http://github.com/Netflix/Lipstick • We welcome feedback and contributions!

54.  Charles Smith: charsmith@netflix.com  Jeff Magnusson: jmagnusson@netflix.com Thank you! Jobs: http://jobs.netflix.com Netflix OSS: http://netflix.github.io Tech Blog: http://techblog.netflix.com/

Editor's Notes

E want to talk today about parts of our big data architecture. …………. We would like to talk about what we are doing to make the data more accessible to the users of the platform.
Like a lot of other companies we are experiencing an explosion of data. Which is good since we are a data-driven company, but if the volume of data makes it harder to find what is useful or makes it harder to process, the value of our data decreases. Alternatively if we decide to only consume data that was useful in the past we won’t continue to find new ways to provide value to our customers. Our goal as a team is to make data available so that anyone at Netflix can use it for interesting new work. We all know data is being created faster than ever before. For Netflix, besides the obvious things that grow over time, like what people are watching, what they are rating, and what they comment on, we have a whole range of additional data. Interaction with our websites, interactions with devices, and things social media, and we have done a lot of interesting work with that data. Even so, the fact of the matter is that we aren’t quite sure what data is going to be useful in the future. So since storage is cheap, we can err on the side of collecting more data than we may ever be able to utilize. And a lot of work has been done on processing that data, but these tools are all relatively new and often require a lot of engineering knowledge to realize the full value of the platform.So the problem is that we have a large volume of data and a large group of smart people that could use that data to help the company. But if they don’t know or can’t find the data that is available, or if it is hard to process the data then it will be a long time before we realize the value.----- Meeting Notes (6/12/13 18:11) -----But this isn't a problem that is specific to Pig. While we've spent a lot of time building systems that can process vast quantities of data, as with all new technologies they tend to only be initially accessible to a group of people in the know. Most likely the engineers that built the system. We don't want to be gatekeepers of the data. The way that we are going to get the most value out of our data, is to have a broader audience. We've found that it's ubiquitous across all facets of the Hadoop user experience. While Hadoop has made it possible to process enourmous quantities of data, tooling hasn't progressed to the point of making possible easy….
S3 is a big place
So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
Jeff and I fall solidly on the engineering side of the spectrum, and as such the technology that goes into our platform is always interesting. But at the end of the day our tools are only truly useful if they allow more effective use our data. So we thought that to talk about our architecture it makes a lot more sense if you approach the problem as a user that just wants to use the data.
Look, Netflix does a lot of things with our data to support the business. But at the end of the day we want to connect our customers with the movies and shows they love. So we thought, what better way to talk about Netflix’s data than to talk a little about building a recommendation system using pieces of our platform. So we are going to have something of a mini-Hack Day if you will.----- Meeting Notes (6/17/13 20:59) -----Connecting users with movies they love.
So very quickly let’s talk about how we will build the recommender. There are two types of recommendations that Netflix usually gives you. One is similarity. Similarity can be thought of as a measure of distance between two movies where the closer two movies are, the more similar they are. The other is personalization. Personalization takes a lot of different forms and is often very complicated, but one way to think of personalization is as a distance between a person and movies, where the close a movie is to a person, the more likely that he or she will like the movie. So what we want to do is come up with a vector space in which we can calculate distance between movies. And once we have done that we will try to project our customers into that space so we can measure distance between customers and movies.
S3 is a big place
Abstraction between name of data and location. Location of datasets can change over time…
Abstraction between name of data and location. Location of datasets can change over time…
It turns out that we didn’t yet have a dataset in Franklin with the box art, but we did have lists of titles that I could use to make sense of the box art images. So I needed to create one.So what I decided to do was convert that into a new dataset that I could use. To do that I downloaded box art for each title and converted it to websafe colors. I did this so that rather than having a hundred different pixels of slightly different colors of orange, I would have three. The 216 websafe colors is a much easier space to work in.
After I created the dataset what I really wanted to do was look at how different titles compare to each other. Now I can do this myself and create some sample graphs, what would be a lot more useful is if I could share the data with the other people working with me and they could easily explore it so they can have an idea of what I am doing.
We found that that it was a common need for our users to visualize our large datasets. So we created a lightweight visualization tool named Sting that makes it easy to explore and socialize the results of Hive queries around the organization.----- Meeting Notes (6/17/13 19:58) -----lightweight data viz framework
Insert more real screen shot here…
What we are looking at here is Sting filtered on three titles. Each bar is the stacked histogram of the title. So you can see that Hemlock grove is about 40% black and then it has mostly gray and some shades of red. House of cards is mostly black and gray with a some blues and reds, and Arrested Development looks mostly Orange. And after a bit of playing around and comparing colors, it seemed though not perfect, that I could do a straight distance calculation in this vector space and get decent results.
So let’s look at how it worked out.
Here you can see House of cards is a mix of blacks and greys, like I pointed out and there is some red in there (blood on the hands, although you probably can’t see it).
And it’s closest title is already a winner. Visually we can see similar colors. And for those of you with knowledge of both titles, you probably think this is so good that I am cheating.
But looking at the titles in Sting we can see visually that what our system is telling us looks right. We would expect these titles to be close.
One of the more polarizing Star Treks, so it has a bunch of purple and various reds and blues and black.
At Netflix, we make heavy use of both pig and hive. Hive is typically used for adhoc analysis, while Pig is used inscheduled workflows.
The scripts can be very complicated – compiling to many map/reduce steps and performing complex data transformations along the way.We’ve been happy with our choice of Pig in that it provides an abstraction to easily express complicated map/reduce logic along with some facilities for code reuse (udfs, macros). When workflows get sufficiently complicated however, Pig is almost so abstract that it becomes hard to follow the data flow logic and image how it will translate to map reduce.
So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
Some key features….

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Similar to Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013) (20)

Recently uploaded

Recently uploaded (20)

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Editor's Notes