- The document discusses Uber's use of stream processing to enable real-time analytics and complex event processing over streaming data from its global ridesharing marketplace.
- Key applications include real-time OLAP, detecting patterns in event streams, and supply positioning to monitor marketplace health.
- The architecture uses Apache Kafka for event collection, Apache Samza for event processing, and storage and visualization applications. It addresses challenges of processing large-scale, real-time geo-temporal data streams.
44. Natural Choice: Apache Kafka
- Low latency and high throughput
- Persistent events
- Distributes a topic by partitions
- Groups consumers by consumer groups
55. Why Apache Samza?
- DAG on Kafka
- Excellent integration with Kafka
- Built-in checkpointing
- Built-in state management
- Excellent support from our data team
80. We Use Lambda
- Spark + HDFS/S3 for batch processing
- Yes, it is painful, but
- We may need to go way back due to change of business
requirements
- Batch process can run faster — they scale differently
- It was not easy to start a new stream processing instance
83. Dealing with Limitation of Samza
-No broadcasting. We have to override
SystemStreamPartitionGrouper
-No dynamic topology. Can’t have arbitrary number of
nested CEP queries
-Tedious configuration and deployment of jobs. In house
code-gem and deployment solution