Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Stream Processing &
Analytics with Flink
Danny Yuan, Engineer @ Uber
@g9yuayon
Four Kinds of Analytics
- On demand aggregation and pattern detection
- Clustering
- Forecasting
- Pattern detection on ge...
Two Ingredients
Geo/Spatial Time
Real-time aggregation and pattern matching
Slide title
Complex Event Processing
How many cars enter and exit a user defined area in past 5 minutes
Examples
It doesn’t have to be within a time or area
CEP with full historical context
Notify me if a partner completed her 100th tr...
Patterns in the future
How many first-time riders will be dropped off in a given area in the next 5
minutes?
Patterns in the future
How many first-time riders will be dropped off in a given area in the next 5
minutes?
Geo: user flexibility is important
Geo: user flexibility is important
It needs to be scalable
It needs to be scalable
It needs to be scalable
- Every hexagon
- Every driver/rider
CEP Pipeline Built on Samza
- No hard-coded CEP rules
- Applying CEP rules per individual entity: topic, driver,
rider, co...
Slide title
Slide title
Slide title
Slide title
Slide title
Slide title
Slide title
Slide title
We need to evolve our architecture for other analytics
Clustering
Slide title Manually Created Cluster
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
Slide titleCall for algorithmically created cluster
- Clustering based on key performance metrics
- Continuously measure t...
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
- Continuously measure ...
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
- Continuously measure ...
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
- Continuously measure ...
Slide title
Home-grown Clustering Service
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Pluggable algorithms and measurements
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Hi...
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Hi...
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Hi...
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Hi...
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Hi...
Forecasting
- Every decision is based on forecasting
Forecasting
- Forecasting based on both historical data and stream input
Forecasting
- Forecasting based on both historical data and stream input
Forecasting
- Forecasting based on both historical data and stream input
Anomaly, 

or emerging demand?
Forecasting
- Spatially granular forecasting - down to every hexagon
Forecasting
- Spatially granular forecasting - down to every hexagon
Forecasting
- Temporally granular forecasting - down to every minute
Forecasting
- Temporally granular forecasting - down to every minute
Pattern Detection
- Similarity of different metrics across geolocation and time
- Metric outliers across geolocations and ...
Common Requirements in Pattern Detection
- Not just traditional time series analysis
- Incorporating insights on marketpla...
Example: Anomaly Detection
- Simple time series analysis
- For a single geo area
- Can be noisy
A More Realistic Anomaly Detection
Example: Anomaly Detection
Example: Anomaly Detection
What’s the right architecture to support the analytics use cases?
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
https://www.oreilly.com/ideas/the-worl...
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
https://www.oreilly.com/ideas/the-worl...
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
https://www.oreilly.com/ideas/the-worl...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
Shared abstraction: multi-di...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- e.g. event-based triggers
...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- e.g. event-based triggers
...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
Shared...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unif...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unif...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unif...
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unif...
- Ordering by event time
- Flexible windowing with watermark and triggers
- Exactly-once semantics
- Built-in state manage...
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
A Simple Example: simple prediction
Sources	
			.fromKafka()	
			.config(config)	
			.cluster(aCluster)	
			.topics(topicL...
A Simple Example
assignTimestampsAndWatermarks
A Simple Example
keyBy(…)
A Simple Example
.timeWindow(…)
A Simple Example
.flatMap(…)
A Simple Example
.keyBy(…)
A Simple Example
.apply(statefulFn)
A Simple Example
.addSink(…)
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
Geotemporal API for efficiency
Geotemporal API for efficiency
Geotemporal API for productivity
Geotemporal API for productivity
Geotemporal API for productivity
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
-...
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
-...
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
-...
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
-...
Choose a Stream Processing Platform
Thank You
Upcoming SlideShare
Loading in …5
×

Streaming Analytics in Uber

788 views

Published on

Streaming Analytics in Uber for QCon SF 2016

Published in: Internet
  • Be the first to comment

Streaming Analytics in Uber

  1. 1. Stream Processing & Analytics with Flink Danny Yuan, Engineer @ Uber @g9yuayon
  2. 2. Four Kinds of Analytics - On demand aggregation and pattern detection - Clustering - Forecasting - Pattern detection on geo-temporal data
  3. 3. Two Ingredients Geo/Spatial Time
  4. 4. Real-time aggregation and pattern matching
  5. 5. Slide title Complex Event Processing
  6. 6. How many cars enter and exit a user defined area in past 5 minutes Examples
  7. 7. It doesn’t have to be within a time or area CEP with full historical context Notify me if a partner completed her 100th trip in a given area just now?
  8. 8. Patterns in the future How many first-time riders will be dropped off in a given area in the next 5 minutes?
  9. 9. Patterns in the future How many first-time riders will be dropped off in a given area in the next 5 minutes?
  10. 10. Geo: user flexibility is important
  11. 11. Geo: user flexibility is important
  12. 12. It needs to be scalable
  13. 13. It needs to be scalable
  14. 14. It needs to be scalable - Every hexagon - Every driver/rider
  15. 15. CEP Pipeline Built on Samza - No hard-coded CEP rules - Applying CEP rules per individual entity: topic, driver, rider, cohorts, and etc - Flexible checkpointing and statement management
  16. 16. Slide title
  17. 17. Slide title
  18. 18. Slide title
  19. 19. Slide title
  20. 20. Slide title
  21. 21. Slide title
  22. 22. Slide title
  23. 23. Slide title
  24. 24. We need to evolve our architecture for other analytics
  25. 25. Clustering
  26. 26. Slide title Manually Created Cluster
  27. 27. Slide titleCall for algorithmically created clusters - Clustering based on key performance metrics
  28. 28. Slide titleCall for algorithmically created cluster - Clustering based on key performance metrics - Continuously measure the clusters
  29. 29. Slide titleCall for algorithmically created clusters - Clustering based on key performance metrics - Continuously measure the clusters - Different clustering for different business needs
  30. 30. Slide titleCall for algorithmically created clusters - Clustering based on key performance metrics - Continuously measure the clusters - Different clustering for different business needs - Create clusters in minutes for all cities
  31. 31. Slide titleCall for algorithmically created clusters - Clustering based on key performance metrics - Continuously measure the clusters - Different clustering for different business needs - Create clusters in minutes for all cities - Foundation for other stream analytics
  32. 32. Slide title Home-grown Clustering Service
  33. 33. Slide title Home-grown Clustering Service - All cities under 3 minutes
  34. 34. Slide title Home-grown Clustering Service - All cities under 3 minutes - Pluggable algorithms and measurements
  35. 35. Slide title Home-grown Clustering Service - All cities under 3 minutes - Easily pluggable algorithms and measurements - Historical geo-temporal data for clustering
  36. 36. Slide title Home-grown Clustering Service - All cities under 3 minutes - Easily pluggable algorithms and measurements - Historical geo-temporal data for clustering - Real-time geo-temporal data for measurement
  37. 37. Slide title Home-grown Clustering Service - All cities under 3 minutes - Easily pluggable algorithms and measurements - Historical geo-temporal data for clustering - Real-time geo-temporal data for measurement - Shared optimizations
  38. 38. Slide title Home-grown Clustering Service - All cities under 3 minutes - Easily pluggable algorithms and measurements - Historical geo-temporal data for clustering - Real-time geo-temporal data for measurement - Shared optimizations. To put things in perspective: - 70,000 hexagons in SF - Naive distance function requires at least 70,000 x 70,000 = 4.9 billion pairs!
  39. 39. Slide title Home-grown Clustering Service - All cities under 3 minutes - Easily pluggable algorithms and measurements - Historical geo-temporal data for clustering - Real-time geo-temporal data for measurement - Shared optimizations - Incremental updates - Compact data representation - Memoization - Avoid anything more complex than O(nlog(n))
  40. 40. Forecasting - Every decision is based on forecasting
  41. 41. Forecasting - Forecasting based on both historical data and stream input
  42. 42. Forecasting - Forecasting based on both historical data and stream input
  43. 43. Forecasting - Forecasting based on both historical data and stream input Anomaly, 
 or emerging demand?
  44. 44. Forecasting - Spatially granular forecasting - down to every hexagon
  45. 45. Forecasting - Spatially granular forecasting - down to every hexagon
  46. 46. Forecasting - Temporally granular forecasting - down to every minute
  47. 47. Forecasting - Temporally granular forecasting - down to every minute
  48. 48. Pattern Detection - Similarity of different metrics across geolocation and time - Metric outliers across geolocations and time - Frequent occurrences of certain patterns - Clustered behavior - Anomalies
  49. 49. Common Requirements in Pattern Detection - Not just traditional time series analysis - Incorporating insights on marketplace data - Required both historical data and real-time input - Spatially granular patterns - down to every hexagon - Temporally granular patterns - down to every minute
  50. 50. Example: Anomaly Detection - Simple time series analysis - For a single geo area - Can be noisy
  51. 51. A More Realistic Anomaly Detection
  52. 52. Example: Anomaly Detection
  53. 53. Example: Anomaly Detection
  54. 54. What’s the right architecture to support the analytics use cases?
  55. 55. Shared abstraction: multi-dimensional geo-temporal data
  56. 56. - Time series by event time Shared abstraction: multi-dimensional geo-temporal data
  57. 57. - Time series by event time Shared abstraction: multi-dimensional geo-temporal data https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  58. 58. - Time series by event time Shared abstraction: multi-dimensional geo-temporal data https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  59. 59. - Time series by event time Shared abstraction: multi-dimensional geo-temporal data https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  60. 60. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered Shared abstraction: multi-dimensional geo-temporal data
  61. 61. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - e.g. event-based triggers Shared abstraction: multi-dimensional geo-temporal data
  62. 62. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - e.g. event-based triggers - e.g., triggers of computation results Shared abstraction: multi-dimensional geo-temporal data
  63. 63. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing Shared abstraction: multi-dimensional geo-temporal data
  64. 64. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing. E.g., Shared abstraction: multi-dimensional geo-temporal data
  65. 65. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing. E.g., Shared abstraction: multi-dimensional geo-temporal data
  66. 66. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing. E.g., Shared abstraction: multi-dimensional geo-temporal data
  67. 67. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing. E.g., Shared abstraction: multi-dimensional geo-temporal data
  68. 68. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing. E.g., State Shared abstraction: multi-dimensional geo-temporal data
  69. 69. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing. E.g., State per key Shared abstraction: multi-dimensional geo-temporal data
  70. 70. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing - Unified stream Shared abstraction: multi-dimensional geo-temporal data
  71. 71. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing - Unified stream - Real-time streams: unbounded streams Shared abstraction: multi-dimensional geo-temporal data
  72. 72. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing - Unified stream - Real-time streams: unbounded streams - Batch: bounded streams Shared abstraction: multi-dimensional geo-temporal data
  73. 73. - Time series by event time - Flexible windowing - tumbling, sliding, conditionally triggered - Stateful processing - Unified stream - Real-time streams: unbounded streams - Batch: bounded streams - s/lambda/kappa Shared abstraction: multi-dimensional geo-temporal data
  74. 74. - Ordering by event time - Flexible windowing with watermark and triggers - Exactly-once semantics - Built-in state management and checkpointing - Nice data flow APIs Apache Flink
  75. 75. Mental Picture for Processing Geo-temporal Data
  76. 76. Mental Picture for Processing Geo-temporal Data
  77. 77. Mental Picture for Processing Geo-temporal Data
  78. 78. Mental Picture for Processing Geo-temporal Data
  79. 79. Mental Picture for Processing Geo-temporal Data
  80. 80. Mental Picture for Processing Geo-temporal Data
  81. 81. Mental Picture for Processing Geo-temporal Data
  82. 82. A Simple Example: simple prediction Sources .fromKafka() .config(config) .cluster(aCluster) .topics(topicList)
  83. 83. A Simple Example assignTimestampsAndWatermarks
  84. 84. A Simple Example keyBy(…)
  85. 85. A Simple Example .timeWindow(…)
  86. 86. A Simple Example .flatMap(…)
  87. 87. A Simple Example .keyBy(…)
  88. 88. A Simple Example .apply(statefulFn)
  89. 89. A Simple Example .addSink(…)
  90. 90. High Level Data Flow
  91. 91. High Level Data Flow
  92. 92. High Level Data Flow
  93. 93. High Level Data Flow
  94. 94. High Level Data Flow
  95. 95. High Level Data Flow
  96. 96. High Level Data Flow
  97. 97. High Level Data Flow
  98. 98. High Level Data Flow
  99. 99. High Level Data Flow
  100. 100. High Level Data Flow
  101. 101. Geotemporal API for efficiency
  102. 102. Geotemporal API for efficiency
  103. 103. Geotemporal API for productivity
  104. 104. Geotemporal API for productivity
  105. 105. Geotemporal API for productivity
  106. 106. Forecasting as an example
  107. 107. Forecasting as an example
  108. 108. Forecasting as an example
  109. 109. Forecasting as an example
  110. 110. Forecasting as an example
  111. 111. Forecasting as an example
  112. 112. Forecasting as an example
  113. 113. Forecasting as an example
  114. 114. Forecasting as an example
  115. 115. Forecasting as an example
  116. 116. Forecasting as an example
  117. 117. Forecasting as an example
  118. 118. Forecasting as an example
  119. 119. Forecasting as an example
  120. 120. Forecasting as an example
  121. 121. Forecasting as an example
  122. 122. Forecasting as an example
  123. 123. Lessons Learned - Make sure you have robust infrastructure support - Scaling up, namely single-node optimization matters - Ensure exactly-once by proper data modeling - Use external state store to avoid too much snapshotting - Standardize monitoring and data validation
  124. 124. Lessons Learned - Make sure you have robust infrastructure support
  125. 125. Lessons Learned - Make sure you have robust infrastructure support
  126. 126. Lessons Learned - Make sure you have robust infrastructure support
  127. 127. Lessons Learned - Make sure you have robust infrastructure support
  128. 128. Lessons Learned - Make sure you have robust infrastructure support - Scaling up, namely single-node optimization matters
  129. 129. Lessons Learned - Make sure you have robust infrastructure support - Scaling up, namely single-node optimization matters - Ensure exactly-once by proper data modeling
  130. 130. Lessons Learned - Make sure you have robust infrastructure support - Scaling up, namely single-node optimization matters - Ensure exactly-once by proper data modeling - Use external state store to avoid too much snapshotting - Standardize monitoring and data validation
  131. 131. Lessons Learned - Make sure you have robust infrastructure support - Scaling up, namely single-node optimization matters - Ensure exactly-once by proper data modeling - Standardize monitoring and data validation
  132. 132. Choose a Stream Processing Platform
  133. 133. Thank You

×