Your SlideShare is downloading. ×
  • Like
netflix-real-time-data-strata-talk
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

netflix-real-time-data-strata-talk

  • 2,535 views
Published

All animations and demos are replaced with static screenshots

All animations and demos are replaced with static screenshots

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,535
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
86
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Real-Time Data Insights In Netflix Danny Yuan (@g9yuayon) Jae BaeFriday, March 1, 13 1
  • 2. Who Am I?Friday, March 1, 13 2crypto service, that manages pretty much all the keys Netflix uses in the cloud, whichtranslates to billions of requests per day.
  • 3. Who Am I? Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon)Friday, March 1, 13 2crypto service, that manages pretty much all the keys Netflix uses in the cloud, whichtranslates to billions of requests per day.
  • 4. Who Am I? Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon) Built and operated Netflix’s cloud crypto serviceFriday, March 1, 13 2crypto service, that manages pretty much all the keys Netflix uses in the cloud, whichtranslates to billions of requests per day.
  • 5. Who Am I? Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon) Built and operated Netflix’s cloud crypto service Worked with Jae Bae on querying multi-dimensional data in real timeFriday, March 1, 13 2crypto service, that manages pretty much all the keys Netflix uses in the cloud, whichtranslates to billions of requests per day.
  • 6. Use CasesFriday, March 1, 13 3We’re going to discuss two types of use cases today: Real-time operational metrics, andbusiness or product insights. By the way, who would know Canadians’ number 1 search querywould be 90210?
  • 7. Use Cases Real-time Operational MetricsFriday, March 1, 13 3We’re going to discuss two types of use cases today: Real-time operational metrics, andbusiness or product insights. By the way, who would know Canadians’ number 1 search querywould be 90210?
  • 8. Use Cases Business or Product InsightsFriday, March 1, 13 3We’re going to discuss two types of use cases today: Real-time operational metrics, andbusiness or product insights. By the way, who would know Canadians’ number 1 search querywould be 90210?
  • 9. What Are Log Events? Field Name Field Value ClientApplication “API” ServerApplication “Cryptex” StatusCode 200 ResponseTime 73Friday, March 1, 13 4Before we dive into use cases, let me explain what our log data look like. Lots of Netflix’s logdata can be represented by “events”. Netflix applications send hundreds of different types oflog events every day.A log event is really just a set of fields. A field has a name and a value. The value itself can bestrings, numbers, or set of fields.
  • 10. Friday, March 1, 13 5Inside Netflix, hundreds of applications run on tens of thousands of machines. Machinescome and go all the time, but they all generate tons of application log events, and send themto highly reliable data collectors. The collectors in turn send data to various destinations.
  • 11. Tens of Thousands of Servers Come and Go Server Farm Server Farm Server FarmFriday, March 1, 13 5Inside Netflix, hundreds of applications run on tens of thousands of machines. Machinescome and go all the time, but they all generate tons of application log events, and send themto highly reliable data collectors. The collectors in turn send data to various destinations.
  • 12. Highly Reliable Collectors Collect Log Events from All Servers Server Farm Server Farm Log Collectors Server FarmFriday, March 1, 13 5Inside Netflix, hundreds of applications run on tens of thousands of machines. Machinescome and go all the time, but they all generate tons of application log events, and send themto highly reliable data collectors. The collectors in turn send data to various destinations.
  • 13. Dynamically Configurable Destinations Server Farm Hadoop Server Farm Kafka Log Collectors HTTP Endpoints Server FarmFriday, March 1, 13 5Inside Netflix, hundreds of applications run on tens of thousands of machines. Machinescome and go all the time, but they all generate tons of application log events, and send themto highly reliable data collectors. The collectors in turn send data to various destinations.
  • 14. Netflix is a log generating company that also happens to stream movies - Adrian CockroftFriday, March 1, 13 6As Adrian used to say, Neflix is a log generating company that also happens to streammovies. When we have vast amount of logs for different applications, we also get a treasuretrove. In fact, numerous teams, BI, operations, product development, data science... Theymine such data all the time. To put this into perspective, let me share some numbers.
  • 15. 1,500,000Friday, March 1, 13 7During peak hours, our data pipeline collects over 1.5 million log events per second
  • 16. 70,000,000,000Friday, March 1, 13 8Or 70 billions a day on average.
  • 17. Making Sense of Billions of EventsFriday, March 1, 13 9Making sense of such vast amount of information is a continuing challenge for Netflix. Afterall, most of the time it is not feasible to look into individual log event to get anything usefulout. We’ve got to have an intelligent ways to digest our data.
  • 18. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 19. We’ve Got ToolsFriday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 20. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 21. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 22. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 23. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 24. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 25. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 26. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 27. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 28. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 29. Friday, March 1, 13 10And over the past couple of years Netflix has built numerous tools to help us.We have this Turbine real-time dashboard for application metrics on live machines. It is alsoopen sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics everysecondWe have CSI, which uses a number of machine learning algorithms to identify correlationsand trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to auser’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help teamuse Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hivequeries.And we had log summarization service that alert people on top error-generating service.They are, however, static snapshot of some data that we can’t easily drill down, and they areusually half an hour late.
  • 30. What Is Missing?Friday, March 1, 13 11Why do we need yet another tool then? The key question is, what is missing?
  • 31. Interactive ExplorationFriday, March 1, 13 12For one thing: interactive exploration. Sometimes we want to get data in real time so we canact quickly. Some data is only useful in a small time window after all. Sometimes we want toperform lots of experimental queries just to find the right insights. If we wait too long for aquery back, we won’t be able to iterate fast enough. Either way, we need to get query resultsback in seconds.
  • 32. Getting Results Back in SecondsFriday, March 1, 13 13Because aggregation is out of the way, we can simply de-dup the error messages and indexthem in a search engine. So, you get the best of the both worlds: an instant error report, andinstant error search engine.
  • 33. Getting Results Back in SecondsFriday, March 1, 13 13Because aggregation is out of the way, we can simply de-dup the error messages and indexthem in a search engine. So, you get the best of the both worlds: an instant error report, andinstant error search engine.
  • 34. Getting Results Back in SecondsFriday, March 1, 13 14Here is one example: we process more than 150 thousand events per second about deviceactivities. What if we’d like to know that geographically how many users started playingvideos in the past 5 minutes? So I submit my query, and in a few seconds....The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped tothe grid and the activity is then counted.But this is an aggregated view. What if I want to drill down the data immediately alongdifferent dimensions? In this particular case, to find out failed attempts on our SilverLightplayers that run on PCs and Macs?
  • 35. Getting Results Back in SecondsFriday, March 1, 13 14Here is one example: we process more than 150 thousand events per second about deviceactivities. What if we’d like to know that geographically how many users started playingvideos in the past 5 minutes? So I submit my query, and in a few seconds....The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped tothe grid and the activity is then counted.But this is an aggregated view. What if I want to drill down the data immediately alongdifferent dimensions? In this particular case, to find out failed attempts on our SilverLightplayers that run on PCs and Macs?
  • 36. Getting Results Back in Seconds 150,000Friday, March 1, 13 14Here is one example: we process more than 150 thousand events per second about deviceactivities. What if we’d like to know that geographically how many users started playingvideos in the past 5 minutes? So I submit my query, and in a few seconds....The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped tothe grid and the activity is then counted.But this is an aggregated view. What if I want to drill down the data immediately alongdifferent dimensions? In this particular case, to find out failed attempts on our SilverLightplayers that run on PCs and Macs?
  • 37. Getting Results Back in SecondsFriday, March 1, 13 14Here is one example: we process more than 150 thousand events per second about deviceactivities. What if we’d like to know that geographically how many users started playingvideos in the past 5 minutes? So I submit my query, and in a few seconds....The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped tothe grid and the activity is then counted.But this is an aggregated view. What if I want to drill down the data immediately alongdifferent dimensions? In this particular case, to find out failed attempts on our SilverLightplayers that run on PCs and Macs?
  • 38. Querying Data Along Different DimensionsFriday, March 1, 13 15And from the same event, we may get answers to different questions:How many people started viewing House of Cards in the past 6 hours?
  • 39. Querying Data Along Different DimensionsFriday, March 1, 13 15And from the same event, we may get answers to different questions:How many people started viewing House of Cards in the past 6 hours?
  • 40. Querying Data Along Different DimensionsFriday, March 1, 13 15And from the same event, we may get answers to different questions:How many people started viewing House of Cards in the past 6 hours?
  • 41. Querying Data Along Different DimensionsFriday, March 1, 13 15And from the same event, we may get answers to different questions:How many people started viewing House of Cards in the past 6 hours?
  • 42. Discover Outstanding DataFriday, March 1, 13 16There are three fundamental questions we usually want to get out large amount data. First is to find theoutstanding data. For small number of rows, we can get a summary table. But for large amount of data, evensummary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. Forexample, don’t you want to know what happens in the last 10 seconds which applications generated most ofthe errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a morecomplete example.Hundreds of thousands of requests captured.
  • 43. Discover Outstanding Data HTTP 500Friday, March 1, 13 16There are three fundamental questions we usually want to get out large amount data. First is to find theoutstanding data. For small number of rows, we can get a summary table. But for large amount of data, evensummary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. Forexample, don’t you want to know what happens in the last 10 seconds which applications generated most ofthe errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a morecomplete example.Hundreds of thousands of requests captured.
  • 44. Discover Outstanding DataFriday, March 1, 13 16There are three fundamental questions we usually want to get out large amount data. First is to find theoutstanding data. For small number of rows, we can get a summary table. But for large amount of data, evensummary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. Forexample, don’t you want to know what happens in the last 10 seconds which applications generated most ofthe errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a morecomplete example.Hundreds of thousands of requests captured.
  • 45. Discover Outstanding DataFriday, March 1, 13 16There are three fundamental questions we usually want to get out large amount data. First is to find theoutstanding data. For small number of rows, we can get a summary table. But for large amount of data, evensummary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. Forexample, don’t you want to know what happens in the last 10 seconds which applications generated most ofthe errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a morecomplete example.Hundreds of thousands of requests captured.
  • 46. Discover Outstanding DataFriday, March 1, 13 16There are three fundamental questions we usually want to get out large amount data. First is to find theoutstanding data. For small number of rows, we can get a summary table. But for large amount of data, evensummary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. Forexample, don’t you want to know what happens in the last 10 seconds which applications generated most ofthe errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a morecomplete example.Hundreds of thousands of requests captured.
  • 47. Discover Outstanding DataFriday, March 1, 13 16There are three fundamental questions we usually want to get out large amount data. First is to find theoutstanding data. For small number of rows, we can get a summary table. But for large amount of data, evensummary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. Forexample, don’t you want to know what happens in the last 10 seconds which applications generated most ofthe errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a morecomplete example.Hundreds of thousands of requests captured.
  • 48. Discover Outstanding DataFriday, March 1, 13 17
  • 49. See Trends Over TimeFriday, March 1, 13 18The second fundamental question is: what are the trends over time? More over, what is thetrend compared to that of the same data in a different time window? Again, slicing and dicingis very important here because it helps us narrow down our view.
  • 50. See Trends Over TimeFriday, March 1, 13 18The second fundamental question is: what are the trends over time? More over, what is thetrend compared to that of the same data in a different time window? Again, slicing and dicingis very important here because it helps us narrow down our view.
  • 51. See Trends Over TimeFriday, March 1, 13 18The second fundamental question is: what are the trends over time? More over, what is thetrend compared to that of the same data in a different time window? Again, slicing and dicingis very important here because it helps us narrow down our view.
  • 52. See Trends Over TimeFriday, March 1, 13 18The second fundamental question is: what are the trends over time? More over, what is thetrend compared to that of the same data in a different time window? Again, slicing and dicingis very important here because it helps us narrow down our view.
  • 53. See Data DistributionsFriday, March 1, 13 19The third fundamental question is: what is the distribution of my data? Merely average is notenough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
  • 54. See Data DistributionsFriday, March 1, 13 19The third fundamental question is: what is the distribution of my data? Merely average is notenough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.
  • 55. Technical ChallengesFriday, March 1, 13 20I’d like to share some technical challenges we encountered when integrating Druid.
  • 56. Friday, March 1, 13 21Even though we instrument code to death, people don’t want to write more code just for anascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, andwe’ve already separated producing logs from consuming logs. Applications have the commonbuild and continuous integration environment, identical deployment base, and sharedplatform runtime.
  • 57. Problem: Minimizing programming effortFriday, March 1, 13 21Even though we instrument code to death, people don’t want to write more code just for anascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, andwe’ve already separated producing logs from consuming logs. Applications have the commonbuild and continuous integration environment, identical deployment base, and sharedplatform runtime.
  • 58. Problem: Minimizing programming effort Solution: -Homogeneous architecture -Separating producing logs from consuming logsFriday, March 1, 13 21Even though we instrument code to death, people don’t want to write more code just for anascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, andwe’ve already separated producing logs from consuming logs. Applications have the commonbuild and continuous integration environment, identical deployment base, and sharedplatform runtime.
  • 59. Friday, March 1, 13 22Every application shares the same design and the same underlying runtime. The logic ofdelivering log event is completely hidden away from programmers. All they need to do isconstructing a log event, and then hand the event to LogManager.Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
  • 60. A Single Data Pipeline Log data Log Filter Collector Agent Log Collectors LogManager.logEvent(anEvent)Friday, March 1, 13 22Every application shares the same design and the same underlying runtime. The logic ofdelivering log event is completely hidden away from programmers. All they need to do isconstructing a log event, and then hand the event to LogManager.Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
  • 61. Log data Log Filter Collector Agent Log Collectorsphoto credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/Friday, March 1, 13 23Since producing log events is dead simple. We move all the processing logic to the backend.We introduced this plugin design that is flexible enough to filter, transform, and dispatch logevents to different destinations with high throughput.Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
  • 62. Isolated Log Processing Log data Log Filter Collector Agent Log Collectorsphoto credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/Friday, March 1, 13 23Since producing log events is dead simple. We move all the processing logic to the backend.We introduced this plugin design that is flexible enough to filter, transform, and dispatch logevents to different destinations with high throughput.Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
  • 63. Isolated Log Processing Log Filter Sink Plugin Hadoop Log Kafka Log data Log Filter Sink Plugin Druid Dispatcher Log Filter Sink Plugin ElasticSearchphoto credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/Friday, March 1, 13 23Since producing log events is dead simple. We move all the processing logic to the backend.We introduced this plugin design that is flexible enough to filter, transform, and dispatch logevents to different destinations with high throughput.Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/
  • 64. Friday, March 1, 13 24Storing and processing log events takes time, requires resources, and ultimately costsmoney. Lots of events are useful only when they are needed. Therefore, we built this filteringcapability into our platform.
  • 65. Problem: Not All Logs Are Worth ProcessingFriday, March 1, 13 24Storing and processing log events takes time, requires resources, and ultimately costsmoney. Lots of events are useful only when they are needed. Therefore, we built this filteringcapability into our platform.
  • 66. Problem: Not All Logs Are Worth Processing Solution: Dynamic FilteringFriday, March 1, 13 24Storing and processing log events takes time, requires resources, and ultimately costsmoney. Lots of events are useful only when they are needed. Therefore, we built this filteringcapability into our platform.
  • 67. Friday, March 1, 13 25We created both a fluent API and a corresponding in-fix mini-language to filter any JavaBean-like object
  • 68. Friday, March 1, 13 26It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allowusers to get query results back in seconds. It doesn’t make sense to ask a user to spend halfan hour just to construct a query, and spend another half an hour to debug the query.
  • 69. Problem: JSON Payload Is TediousFriday, March 1, 13 26It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allowusers to get query results back in seconds. It doesn’t make sense to ask a user to spend halfan hour just to construct a query, and spend another half an hour to debug the query.
  • 70. Problem: JSON Payload Is Tedious Solution: Build a parserFriday, March 1, 13 26It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allowusers to get query results back in seconds. It doesn’t make sense to ask a user to spend halfan hour just to construct a query, and spend another half an hour to debug the query.
  • 71. curl -X POST http://druid -d @dataFriday, March 1, 13 27Added benefit of using a parser upfront is to catch all the semantic errors early.
  • 72. curl -X POST http://druid -d @dataFriday, March 1, 13 27Added benefit of using a parser upfront is to catch all the semantic errors early.
  • 73. Friday, March 1, 13 28This is a nascent system with quite a few moving parts. We needed to add new data sources,remove data sources, update schemas for data sources, or debug for certain data sources.Such operations should be easy, and should have minimal impact to a production system.
  • 74. Problem: Managing data sources can be hairyFriday, March 1, 13 28This is a nascent system with quite a few moving parts. We needed to add new data sources,remove data sources, update schemas for data sources, or debug for certain data sources.Such operations should be easy, and should have minimal impact to a production system.
  • 75. Problem: Managing data sources can be hairy Solution: Use cell-like deploymentFriday, March 1, 13 28This is a nascent system with quite a few moving parts. We needed to add new data sources,remove data sources, update schemas for data sources, or debug for certain data sources.Such operations should be easy, and should have minimal impact to a production system.
  • 76. Druid Druid Druid Kafka Kafka Kafka Log Data PipelineFriday, March 1, 13 29We use a cell-like architecture. Each data source has its own persistent queue, its ownconfiguration, and its own indexing cluster. Adding a new data source requires only adding anew set of asgs.Tuning also becomes isolated.
  • 77. Integrating with Netflix’s InfrastructureFriday, March 1, 13 30Integration with Netflix’s infrastructure is essential. We need insights to operate this system,and we need smooth operations.
  • 78. Friday, March 1, 13 31For example, the current deployment handles 380,000 messages per second, or close to2TB/hour during its peak time. Without integration into our monitoring system, we wouldn’tknow system glitches as shown in this chart.
  • 79. On Netflix Side • Integrating Kafka with Netflix cloud • Real-time plug-in on Netflix’s data pipeline • User-configurable event filteringFriday, March 1, 13 32
  • 80. On Druid Side • Integration with Netflix’s monitoring system − Emitter+Servo • Integration with Netflix’s platform library • Handling of Zookeeper’s session interruption • Tuning sharding spec for linear scalabilityFriday, March 1, 13 33Emitter integration with ServoThere are lots of injection points in Druid where we can introduce our own implementations.This greatly helped our integration.
  • 81. Druid Event Filter Collector Agent Log Collectors rtexplorerFriday, March 1, 13 34We built our tool sets on top of many excellent open source tools, and it’s our pleasure tocontribute back. Therefore, we’re going to open source all the tools we built some time thisyear.
  • 82. Open Source Plan Druid Event Filter Collector Agent Log Collectors rtexplorerFriday, March 1, 13 34We built our tool sets on top of many excellent open source tools, and it’s our pleasure tocontribute back. Therefore, we’re going to open source all the tools we built some time thisyear.