213 event processingtalk-deviewkorea.key

3,027 views

Published on

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,027
On SlideShare
0
From Embeds
0
Number of Embeds
1,883
Actions
Shares
0
Downloads
74
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

213 event processingtalk-deviewkorea.key

  1. 1. Real-time Insights into Application Events
  2. 2. Who Are We? Software Engineers in Netflix’s Platform Engineering team, working on very large scale data infrastructure Building and operating Netflix’s cloud real-time query service
  3. 3. Why We Are Here?
  4. 4. No Monitoring Metrics Today
  5. 5. Netflix is a log generating company that also happens to stream movies - Adrian Cockroft photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/
  6. 6. 1,500,000
  7. 7. 70,000,000,000
  8. 8. Making Sense of Billions of Events
  9. 9. A Humble Beginning
  10. 10. Things Changed
  11. 11. Application Application Application Application Application Application Application Application Application Application
  12. 12. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  13. 13. What Is Missing?
  14. 14. Interactive Exploration
  15. 15. Use Cases Real-time or Product Business Operational Metrics Insignts
  16. 16. Getting Results Back in Seconds 150,000
  17. 17. Querying Data Along Different Dimensions
  18. 18. Discover Outstanding Data HTTP 500
  19. 19. Discover Outstanding Data
  20. 20. See Trends Over Time
  21. 21. See Data Distributions
  22. 22. It’s All about Extracting Small Data Out of Big Data
  23. 23. But Then What?
  24. 24. Intelligent Alerts
  25. 25. Guided Debugging in the Right Context
  26. 26. Guided Debugging in the Right Context
  27. 27. Guided Debugging in the Right Context
  28. 28. Technical Challenges
  29. 29. Problem: Minimizing programming effort Solution: -Homogeneous architecture - Separating producing logs from consuming logs
  30. 30. Field Name Field Value Client “API” Server “Cryptex” StatusCode 200 ResponseTime 73
  31. 31. A Single Data Pipeline Log data Log Filter Collector Agent Log Collectors LogManager.logEvent(anEvent)
  32. 32. Reliable/Flexible Data Pipeline Log Filter Sink Plugin Log Filter Sink Plugin Log Filter Sink Plugin Server Farm Server Farm Log Collectors Hadoop Kafka Druid Server Farm photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/ Kafka ElasticSearch
  33. 33. Problem: Not All Logs Are Worth Processing Solution: Dynamic Filtering
  34. 34. Problem: Realtime Ingestion Solution: Druid & ElasticSearch
  35. 35. ElasticSearch -Distributed restful search analytics - Lucene based, Full text search - High availability - Faceted search, a little slow
  36. 36. Druid -Real-time indexing and querying - Arbitrary slicing and dicing, rolling up and drilling down - Packaged queries - TopN, Time Series, Histograms, Cardinalities
  37. 37. Druid Architecture RealTime Nodes Hand off data Historical Nodes Deep Storage Query API Query API Query Rewrite Scatter/Gatter Broker Nodes slide credit: Eric Cheddar @Metamx
  38. 38. Colmum Compression timestamp 2011-01-01T00:01:35Z 2011-01-01T00:03:63Z 2011-01-01T00:04:51Z 2011-01-01T01:00:00Z 2011-01-01T02:00:00Z 2011-01-01T02:00:00Z ... publisher advertiser gender country bieberfever.com google.com Male USA bieberfever.com google.com Male USA bieberfever.com google.com Male USA ultratrimfast.com google.com Female UK ultratrimfast.com google.com Female UK ultratrimfast.com google.com Female UK Create Ids: bieberfever.com -> 0, ultratrimfast.com-> 1 Store: publisher -> [0, 0, 0, 1, 1, 1] advertiser -> [0, 0, 0, 0, 0, 0] slide credit: Eric Cheddar @Metamx ... 0.65 0.62 0.45 0.87 0.99 1.53
  39. 39. Bitmap Index timestamp 2011-01-01T00:01:35Z 2011-01-01T00:03:63Z 2011-01-01T00:04:51Z 2011-01-01T01:00:00Z 2011-01-01T02:00:00Z 2011-01-01T02:00:00Z ... publisher bieberfever.com bieberfever.com bieberfever.com ultratrimfast.com ultratrimfast.com ultratrimfast.com advertiser google.com google.com google.com google.com google.com google.com gender Male Male Male Female Female Female country USA USA USA UK UK UK ... 0.65 0.62 0.45 0.87 0.99 1.53 bieberfever.com -> [0, 1, 2] -> [111000] ultratrimfast.com -> [3, 4, 5] -> [000111] Compress CONCISE http://ricerca.mat.uniroma3.it/users/colanton/co slide credit: Eric Cheddar @Metamx
  40. 40. Problem: JSON Payload Is Tedious Solution: Build a parser
  41. 41. curl -X POST http://druid -d @data
  42. 42. There’s More
  43. 43. System Monitoring System Resilience System Operability
  44. 44. Problem: So many combinations of configurations Solution: Build a flexible load testing tool
  45. 45. Problem: Managing data sources can be hairy Solution: Use cell-like deployment
  46. 46. Druid Kafka Druid Druid Kafka Kafka Log Data Pipeline
  47. 47. Problem: How do we know everything of the new systems? Solution: Extensive instrumentation with Servo and Atlas
  48. 48. Problem: Many open-sourced solutions assume static configuration Solution: Integrating with Netflix platform, particularly Eureka
  49. 49. Problem: Zookeeper goes down, and so does connections for Kakfa clients Solution: Replacing zkClient with Apache Curator
  50. 50. Technology Stacks - Netflix OSS: Powerful cloud computation - Suro: Internal main data pipeline - Kafka: High-throughput and durable message queue - Druid: Efficient real-time multi-dimensional database on large-scale data - ElasticSearch: Distributed search engine - Kibana: ElasticSearch UI - Zookeeper: Distributed coordinator
  51. 51. Thank You!

×