Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Director of Data Platform
Jun. 27, 2013
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
1 of 54

More Related Content

What's hot

DATA @ NFLX (Tableau Conference 2014 Presentation)DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)Blake Irvine
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

What's hot(20)

Viewers also liked

Netflix: Wachstumsstrategie zeigt WirkungNetflix: Wachstumsstrategie zeigt Wirkung
Netflix: Wachstumsstrategie zeigt WirkungStefan Böhm
OSCON 2015OSCON 2015
OSCON 2015Charles Smith
OrgeneOrgene
Orgenealegna301
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit

Similar to Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee

Similar to Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)(20)

Recently uploaded

9C Monthly Newsletter - SEPT 20239C Monthly Newsletter - SEPT 2023
9C Monthly Newsletter - SEPT 2023PublishingTeam
Common WordPress APIs - Options APICommon WordPress APIs - Options API
Common WordPress APIs - Options APIJonathan Bossenger
h2 meet pdf test.pdfh2 meet pdf test.pdf
h2 meet pdf test.pdfJohnLee971654
How is AI changing journalism? Strategic considerations for publishers and ne...How is AI changing journalism? Strategic considerations for publishers and ne...
How is AI changing journalism? Strategic considerations for publishers and ne...Damian Radcliffe
GIT AND GITHUB (1).pptxGIT AND GITHUB (1).pptx
GIT AND GITHUB (1).pptxGDSCCVRGUPoweredbyGo
GDSC ZHCET Google Study Jams 23.pdfGDSC ZHCET Google Study Jams 23.pdf
GDSC ZHCET Google Study Jams 23.pdfAbhishekSingh313342

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Editor's Notes

  1. E want to talk today about parts of our big data architecture. …………. We would like to talk about what we are doing to make the data more accessible to the users of the platform.
  2. Like a lot of other companies we are experiencing an explosion of data. Which is good since we are a data-driven company, but if the volume of data makes it harder to find what is useful or makes it harder to process, the value of our data decreases. Alternatively if we decide to only consume data that was useful in the past we won’t continue to find new ways to provide value to our customers. Our goal as a team is to make data available so that anyone at Netflix can use it for interesting new work. We all know data is being created faster than ever before. For Netflix, besides the obvious things that grow over time, like what people are watching, what they are rating, and what they comment on, we have a whole range of additional data. Interaction with our websites, interactions with devices, and things social media, and we have done a lot of interesting work with that data. Even so, the fact of the matter is that we aren’t quite sure what data is going to be useful in the future. So since storage is cheap, we can err on the side of collecting more data than we may ever be able to utilize. And a lot of work has been done on processing that data, but these tools are all relatively new and often require a lot of engineering knowledge to realize the full value of the platform.So the problem is that we have a large volume of data and a large group of smart people that could use that data to help the company. But if they don’t know or can’t find the data that is available, or if it is hard to process the data then it will be a long time before we realize the value.----- Meeting Notes (6/12/13 18:11) -----But this isn't a problem that is specific to Pig. While we've spent a lot of time building systems that can process vast quantities of data, as with all new technologies they tend to only be initially accessible to a group of people in the know. Most likely the engineers that built the system. We don't want to be gatekeepers of the data. The way that we are going to get the most value out of our data, is to have a broader audience. We've found that it's ubiquitous across all facets of the Hadoop user experience. While Hadoop has made it possible to process enourmous quantities of data, tooling hasn't progressed to the point of making possible easy….
  3. S3 is a big place
  4. So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
  5. Jeff and I fall solidly on the engineering side of the spectrum, and as such the technology that goes into our platform is always interesting. But at the end of the day our tools are only truly useful if they allow more effective use our data. So we thought that to talk about our architecture it makes a lot more sense if you approach the problem as a user that just wants to use the data.
  6. Look, Netflix does a lot of things with our data to support the business. But at the end of the day we want to connect our customers with the movies and shows they love. So we thought, what better way to talk about Netflix’s data than to talk a little about building a recommendation system using pieces of our platform. So we are going to have something of a mini-Hack Day if you will.----- Meeting Notes (6/17/13 20:59) -----Connecting users with movies they love.
  7. So very quickly let’s talk about how we will build the recommender. There are two types of recommendations that Netflix usually gives you. One is similarity. Similarity can be thought of as a measure of distance between two movies where the closer two movies are, the more similar they are. The other is personalization. Personalization takes a lot of different forms and is often very complicated, but one way to think of personalization is as a distance between a person and movies, where the close a movie is to a person, the more likely that he or she will like the movie. So what we want to do is come up with a vector space in which we can calculate distance between movies. And once we have done that we will try to project our customers into that space so we can measure distance between customers and movies.
  8. S3 is a big place
  9. Abstraction between name of data and location. Location of datasets can change over time…
  10. Abstraction between name of data and location. Location of datasets can change over time…
  11. It turns out that we didn’t yet have a dataset in Franklin with the box art, but we did have lists of titles that I could use to make sense of the box art images. So I needed to create one.So what I decided to do was convert that into a new dataset that I could use. To do that I downloaded box art for each title and converted it to websafe colors. I did this so that rather than having a hundred different pixels of slightly different colors of orange, I would have three. The 216 websafe colors is a much easier space to work in.
  12. After I created the dataset what I really wanted to do was look at how different titles compare to each other. Now I can do this myself and create some sample graphs, what would be a lot more useful is if I could share the data with the other people working with me and they could easily explore it so they can have an idea of what I am doing.
  13. We found that that it was a common need for our users to visualize our large datasets. So we created a lightweight visualization tool named Sting that makes it easy to explore and socialize the results of Hive queries around the organization.----- Meeting Notes (6/17/13 19:58) -----lightweight data viz framework
  14. Insert more real screen shot here…
  15. What we are looking at here is Sting filtered on three titles. Each bar is the stacked histogram of the title. So you can see that Hemlock grove is about 40% black and then it has mostly gray and some shades of red. House of cards is mostly black and gray with a some blues and reds, and Arrested Development looks mostly Orange. And after a bit of playing around and comparing colors, it seemed though not perfect, that I could do a straight distance calculation in this vector space and get decent results.
  16. So let’s look at how it worked out.
  17. Here you can see House of cards is a mix of blacks and greys, like I pointed out and there is some red in there (blood on the hands, although you probably can’t see it).
  18. And it’s closest title is already a winner. Visually we can see similar colors. And for those of you with knowledge of both titles, you probably think this is so good that I am cheating.
  19. But looking at the titles in Sting we can see visually that what our system is telling us looks right. We would expect these titles to be close.
  20. One of the more polarizing Star Treks, so it has a bunch of purple and various reds and blues and black.
  21. At Netflix, we make heavy use of both pig and hive. Hive is typically used for adhoc analysis, while Pig is used inscheduled workflows.
  22. The scripts can be very complicated – compiling to many map/reduce steps and performing complex data transformations along the way.We’ve been happy with our choice of Pig in that it provides an abstraction to easily express complicated map/reduce logic along with some facilities for code reuse (udfs, macros). When workflows get sufficiently complicated however, Pig is almost so abstract that it becomes hard to follow the data flow logic and image how it will translate to map reduce.
  23. So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
  24. Some key features….