Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

6,584 views

Published on

Slides on the OSCON talk about the data platform used at Netflix for event collection, aggregation, and analysis. The platform helps Netflix process and analyze billions of events every day. Attendees will learn how to assemble their own large-scale data pipeline/analytics platform using open source software from NetflixOSS and others, such as Kafka, ElasticSearch, Druid from Metamarkets, and Hive.

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

  1. 1. Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software
  2. 2. Data Is the most important asset at Netflix
  3. 3. If all the data is easily available to all teams, it can be leveraged in new and exciting ways
  4. 4. ~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard
  5. 5. Type of Events • User Interface Events • Search Event (‘Matrix’ using PS3 …) • Star Rating Event (HoC : 5 stars, Xbox, US, …) • Infrastructural Events • RPC Call (API -> Billing Service, ‘/bill/..’, 200, …) • Log Errors (NPE, “Movie is null”, …, …) • Other Events …
  6. 6. Making Sense of Billions of Events
  7. 7. http://netflix.github.io +
  8. 8. A Humble Beginning
  9. 9. Evolution …Scale!
  10. 10. Application Application Application Application Application Application Application Application ApplicationApplication
  11. 11. We Want to Process App Data in Hadoop
  12. 12. Our Hadoop Ecosystem
  13. 13. @NetflixOSS Big Data Tools
  14. 14. Hadoop as a Service
  15. 15. Pig Scripting on Steroids
  16. 16. Pig Married to Clojure “Map-Reduce for Clojure”
  17. 17. S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
  18. 18. Efficient ETL with Cassandra Cassandra
  19. 19. Offline Analysis
  20. 20. Evolution … Speed!
  21. 21. We Want to Aggregate, Index, and Query Data in Real Time
  22. 22. Interactive Exploration
  23. 23. Let’s walk through some use cases
  24. 24. client activity event * /name = “movieStarts”
  25. 25. Pipeline Challenges • App owners: send and forget • Data scientists: validation, ETL, batch processing • DevOps: stream processing, targeted search
  26. 26. Message Routing
  27. 27. We Want to Consume Data Selectively in Different Ways
  28. 28. • Message broker • High-throughput • Persistent and replicated
  29. 29. There Is More
  30. 30. Intelligent Alerts
  31. 31. Intelligent Alerts
  32. 32. Guided Debugging in the Right Context
  33. 33. Guided Debugging in the Right Context
  34. 34. Guided Debugging in the Right Context
  35. 35. • Ad-hoc query with different dimensions • Quick aggregations and Top-N queries • Time series with flexible filters • Quick access to raw data using boolean queries What We Need
  36. 36. Druid • Rapid exploration of high dimensional data • Fast ingestion and querying • Time series
  37. 37. • Real-time indexing of event streams • Killer feature: boolean search • Great UI: Kibana
  38. 38. The Old Pipeline
  39. 39. The New Pipeline
  40. 40. There Is More
  41. 41. It’s Not All About Counters and Time Series
  42. 42. RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200
  43. 43. Distributed Tracing
  44. 44. Distributed Tracing
  45. 45. Distributed Tracing
  46. 46. A System that Supports All These
  47. 47. A Data Pipeline To Glue Them All
  48. 48. Make It Simple
  49. 49. Message Producing • Simple and Uniform API • messageBus.publish(event)
  50. 50. Consumption Is Simple Too consumer.observe().subscribe(new Subscriber<>() { @Override public void onNext(Ackable<IncomingMessage> ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume()
  51. 51. RxJava • Functional reactive programming model • Powerful streaming API • Separation of logic and threading model
  52. 52. Design Decisions • Top Priority: app stability and throughput • Asynchronous operations • Aggressive buffering • Drops messages if necessary
  53. 53. Anything Can Fail
  54. 54. Cloud Resiliency
  55. 55. Fault Tolerance Features • Write and forward with auto-reattached EBS (Amazon’s Elastic Block Storage) • disk-backed queue: big-queue • Customized scaling down
  56. 56. There’s More to Do • Contribute to @NetflixOSS • Join us :-)
  57. 57. Summary http://netflix.github.io +
  58. 58. You can build your own web-scale data pipeline using open source components
  59. 59. Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon

×