Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)


Published on

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.

From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post:

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • E want to talk today about parts of our big data architecture. …………. We would like to talk about what we are doing to make the data more accessible to the users of the platform.
  • Like a lot of other companies we are experiencing an explosion of data. Which is good since we are a data-driven company, but if the volume of data makes it harder to find what is useful or makes it harder to process, the value of our data decreases. Alternatively if we decide to only consume data that was useful in the past we won’t continue to find new ways to provide value to our customers. Our goal as a team is to make data available so that anyone at Netflix can use it for interesting new work. We all know data is being created faster than ever before. For Netflix, besides the obvious things that grow over time, like what people are watching, what they are rating, and what they comment on, we have a whole range of additional data. Interaction with our websites, interactions with devices, and things social media, and we have done a lot of interesting work with that data. Even so, the fact of the matter is that we aren’t quite sure what data is going to be useful in the future. So since storage is cheap, we can err on the side of collecting more data than we may ever be able to utilize. And a lot of work has been done on processing that data, but these tools are all relatively new and often require a lot of engineering knowledge to realize the full value of the platform.So the problem is that we have a large volume of data and a large group of smart people that could use that data to help the company. But if they don’t know or can’t find the data that is available, or if it is hard to process the data then it will be a long time before we realize the value.----- Meeting Notes (6/12/13 18:11) -----But this isn't a problem that is specific to Pig. While we've spent a lot of time building systems that can process vast quantities of data, as with all new technologies they tend to only be initially accessible to a group of people in the know. Most likely the engineers that built the system. We don't want to be gatekeepers of the data. The way that we are going to get the most value out of our data, is to have a broader audience. We've found that it's ubiquitous across all facets of the Hadoop user experience. While Hadoop has made it possible to process enourmous quantities of data, tooling hasn't progressed to the point of making possible easy….
  • S3 is a big place
  • So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
  • Jeff and I fall solidly on the engineering side of the spectrum, and as such the technology that goes into our platform is always interesting. But at the end of the day our tools are only truly useful if they allow more effective use our data. So we thought that to talk about our architecture it makes a lot more sense if you approach the problem as a user that just wants to use the data.
  • Look, Netflix does a lot of things with our data to support the business. But at the end of the day we want to connect our customers with the movies and shows they love. So we thought, what better way to talk about Netflix’s data than to talk a little about building a recommendation system using pieces of our platform. So we are going to have something of a mini-Hack Day if you will.----- Meeting Notes (6/17/13 20:59) -----Connecting users with movies they love.
  • So very quickly let’s talk about how we will build the recommender. There are two types of recommendations that Netflix usually gives you. One is similarity. Similarity can be thought of as a measure of distance between two movies where the closer two movies are, the more similar they are. The other is personalization. Personalization takes a lot of different forms and is often very complicated, but one way to think of personalization is as a distance between a person and movies, where the close a movie is to a person, the more likely that he or she will like the movie. So what we want to do is come up with a vector space in which we can calculate distance between movies. And once we have done that we will try to project our customers into that space so we can measure distance between customers and movies.
  • S3 is a big place
  • Abstraction between name of data and location. Location of datasets can change over time…
  • Abstraction between name of data and location. Location of datasets can change over time…
  • It turns out that we didn’t yet have a dataset in Franklin with the box art, but we did have lists of titles that I could use to make sense of the box art images. So I needed to create one.So what I decided to do was convert that into a new dataset that I could use. To do that I downloaded box art for each title and converted it to websafe colors. I did this so that rather than having a hundred different pixels of slightly different colors of orange, I would have three. The 216 websafe colors is a much easier space to work in.
  • After I created the dataset what I really wanted to do was look at how different titles compare to each other. Now I can do this myself and create some sample graphs, what would be a lot more useful is if I could share the data with the other people working with me and they could easily explore it so they can have an idea of what I am doing.
  • We found that that it was a common need for our users to visualize our large datasets. So we created a lightweight visualization tool named Sting that makes it easy to explore and socialize the results of Hive queries around the organization.----- Meeting Notes (6/17/13 19:58) -----lightweight data viz framework
  • Insert more real screen shot here…
  • What we are looking at here is Sting filtered on three titles. Each bar is the stacked histogram of the title. So you can see that Hemlock grove is about 40% black and then it has mostly gray and some shades of red. House of cards is mostly black and gray with a some blues and reds, and Arrested Development looks mostly Orange. And after a bit of playing around and comparing colors, it seemed though not perfect, that I could do a straight distance calculation in this vector space and get decent results.
  • So let’s look at how it worked out.
  • Here you can see House of cards is a mix of blacks and greys, like I pointed out and there is some red in there (blood on the hands, although you probably can’t see it).
  • And it’s closest title is already a winner. Visually we can see similar colors. And for those of you with knowledge of both titles, you probably think this is so good that I am cheating.
  • But looking at the titles in Sting we can see visually that what our system is telling us looks right. We would expect these titles to be close.
  • One of the more polarizing Star Treks, so it has a bunch of purple and various reds and blues and black.
  • At Netflix, we make heavy use of both pig and hive. Hive is typically used for adhoc analysis, while Pig is used inscheduled workflows.
  • The scripts can be very complicated – compiling to many map/reduce steps and performing complex data transformations along the way.We’ve been happy with our choice of Pig in that it provides an abstraction to easily express complicated map/reduce logic along with some facilities for code reuse (udfs, macros). When workflows get sufficiently complicated however, Pig is almost so abstract that it becomes hard to follow the data flow logic and image how it will translate to map reduce.
  • So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
  • Some key features….
  • Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

    1. 1. Watching Pigs Fly with the Netflix Hadoop Toolkit Hadoop Summit 2013 San Jose, CA
    2. 2. Data should be accessible, easy to discover, and easy to process for everyone. Our Motivation
    3. 3. Our Users Analysts Engineers
    4. 4. Hadoop Platform as a Service
    5. 5. Hadoop Platform as a Service S3
    6. 6. Hadoop Platform as a Service Data Platform
    7. 7. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization) Forklift (Data Movement) Looper (Backloading) Ignite (A/B Test Analytics) Spock (Data Auditing) Genie (Hadoop PaaS) Lipstick (Pig Workflow Visualization) Event Service (Orchestration) Hadoop S3 Other Processing
    8. 8. Let’s solve a problem using the data!
    9. 9. Build a recommender.
    10. 10. But, what makes good recommendations? Similarity Personalization
    11. 11. COLORS!
    12. 12. COLORS! Box art is colorful…
    13. 13. We’re Sorry COLORS! Box art is colorful…
    14. 14. Where can I find the data?
    15. 15. Hadoop Platform as a Service S3
    16. 16. Hadoop Platform as a Service S3Cassandra TeradataRedshiftRDS
    17. 17. Data Platform as a Service Franklin (Metadata API) S3Cassandra TeradataRedshiftRDS
    18. 18. Data Platform as a Service Franklin (Metadata API)
    19. 19. Create a dataset for box art and color.
    20. 20. Whether your dataset is large or small, being able to visualize it makes it easier to explain.
    21. 21. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization)
    22. 22. Sting • Allows users to cache the results of a genie job in memory • Sub second response to OLAP style operations (slicing, dicing, aggregations). • Adhoc / recurring schedule • Easy to use!
    23. 23. Hive Query Schema
    24. 24. % Content Consumed / Hour
    25. 25. Hemlock Grove House of Cards Arrested Development
    26. 26. Similarity
    27. 27. House of Cards Macbeth
    28. 28. Toddlers & Tiaras Star Trek: Voyager
    29. 29. Personalization
    30. 30. # of subscribers X # of titles = ???,000,…,000 (big data) Big Data
    31. 31. Netflix Apache Pig
    32. 32. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization)
    33. 33. Lipstick • Allows users to visualize their data flow • Allows users to see common errors • Allows users to easily monitor their jobs • Empowers users to support themselves • Facilitates communication between infrastructure team and users
    34. 34. Lipstick
    35. 35. Overall Job Progress
    36. 36. Logical Plan Overall Job Progress
    37. 37. Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded
    38. 38. Hadoop Counters
    39. 39. My Job has stalled. Common Problem #1
    40. 40. Unoptimized/Optimized Logical Plan Toggle Dangling Operator
    41. 41. I didn’t get the data I was expecting Common Problem #2
    42. 42. I don’t understand why my job failed. Common Problem #3
    43. 43. Failed Job (light red background) Successful Job (light blue background)
    44. 44. Wrapping up • Demos at the Netflix booth in the exhibit hall (see more Lipstick, Sting, and Genie). • Lipstick is part of Netflix OSS. • Clone it on github at • We welcome feedback and contributions!
    45. 45.  Charles Smith:  Jeff Magnusson: Thank you! Jobs: Netflix OSS: Tech Blog: