Athena 
Streaming with Samza 

Goddess of Wisdom
The team
Brief Background
Stream Processing, Samza, Athena
Kafka Uber uses Kafka as logging/event system
Message streams
Near-real time
computation
Stream Processing Framework:
Apache Samza
Platform built on top of Samza:
Athena
Current use cases
Current use cases
Aggregation
Pricing
Assess supply and demand in real time to calculate
accurate surge multiples
Kafka Samza
location updates
Elasticsearch
S3 Spark
location updates
HTTP
Service
Realtime
Batch
Queries
Mapper
Job
Mapper
Job
Mapper
Job
Event	
  parsing,	
  filtering,	
  
classification	
  
Event	
  Aggregation	
  
Raw
events
Riak
Inserter
Job
Inserter
Job
Tiles
1m Agg
TilesReducer
Job
Reducer
Job
1m Agg
Tiles
Per	
  Contract	
   Common	
  
Artemis
Samza
Current use cases
Event driven update engine
Driver Activation
KAFKA SAMZA
Driver status
updates
Onboarding
Service
Fetch
driver
info
RocksDB
Retry queue
Update
status
Upcoming use cases
Fraud monitoring and alerting
KAFKA Partitioner
Real Time
Hourly
Alerting
Service
Monitoring
Service
Uberx metrics
Da Vinci: Streaming platform for data science
KAFKA
Aggregation job
(Model generation)
RocksDB
Historical state
Timeseries
data
Model evaluation
Database /
Elasticsearch
updates
Model
Updates
Samza Architecture Overview
Samza Architecture
Basic Structure of a Task
Task
Deployment in Uber
Athena Tooling
Tooling
●  Athena manager
●  Job configuration
●  Unit test framework
●  Graphite integration
●  Codahale library support
●  Maven archetype
●  Artifactory support
Athena Manager
Job configuration
reference.conf
Sandbox Staging ProductionDev
athena-core-lib
application.conf
Sandbox Staging ProductionDev
Samza job
Job configuration
projects {
artemis {
job_1 {
mapper {
envs.common {
task.inputs = "kafka.topic_1"
task.class = "com.uber.athena.SampleTaskClass"
}
envs.local = ${projects.artemis.job_1.mapper.envs.common}
envs.local = {
task.window.ms = 30000
}
envs.sandbox = ${projects.artemis.job_1.mapper.envs.local}
envs.sandbox = {
yarn.package.path = "http://artifactory..../artifactory/libs-snapshot-local/com/uber/athena/.../hello-athena.tar.gz"
}
envs.staging = ${projects.artemis.job_1.mapper.envs.sandbox}
envs.production = ${projects.artemis.job_1.mapper.envs.sandbox}
}
Job configuration
Unit test framework
Stream Job
StreamTask InitableTask WindowableTask
TaskUnitTestHarness
Message Listener
Inject data Custom
IncomingMessageEnvelope
Custom
MessageCollector
Unit test framework
String classTaskName = "com.uber.athena.test.TestProcessTask";
// Register the job to the test harness
TaskUnitTestHarness<String, Integer> testProcessTask = new
TaskUnitTestHarness<>(classTaskName, false, true);
// Register a listener for validating the output of every process function
testProcessTask.registerMessageListenerOnProcess(new KeyAsserterListener());
// Start the job
testProcessTask.start();
// Inject data
testProcessTask.inject(key,value);
// Get the full output
testWindowTask.getResult()
Tooling
●  Athena manager
●  Job configuration
●  Unit test framework
●  Graphite integration
●  Codahale library support
●  Maven archetype
●  Artifactory support
Observations
Observations
●  YARN is not bad !
●  Offset lag and Buffered messages
●  Kafka Appender for ELK
●  Checkpoint topic partition incorrect count
●  Config validation needs improvement
●  Job restarts are complicated
●  Built-in Metrics are insufficient
●  Seamless upgrades
●  Custom built-in serde support
●  Config validation enhancement
●  Auto benchmarking a Samza job
●  Unit test framework enhancement
Upcoming Samza improvements
Q&A

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3

  • 1.
    Athena Streaming withSamza Goddess of Wisdom
  • 2.
  • 3.
    Brief Background Stream Processing,Samza, Athena Kafka Uber uses Kafka as logging/event system Message streams Near-real time computation Stream Processing Framework: Apache Samza Platform built on top of Samza: Athena
  • 4.
  • 5.
  • 6.
    Pricing Assess supply anddemand in real time to calculate accurate surge multiples Kafka Samza location updates Elasticsearch S3 Spark location updates HTTP Service Realtime Batch Queries
  • 7.
    Mapper Job Mapper Job Mapper Job Event  parsing,  filtering,   classification   Event  Aggregation   Raw events Riak Inserter Job Inserter Job Tiles 1m Agg TilesReducer Job Reducer Job 1m Agg Tiles Per  Contract   Common   Artemis Samza
  • 8.
    Current use cases Eventdriven update engine
  • 9.
    Driver Activation KAFKA SAMZA Driverstatus updates Onboarding Service Fetch driver info RocksDB Retry queue Update status
  • 10.
  • 11.
    Fraud monitoring andalerting KAFKA Partitioner Real Time Hourly Alerting Service Monitoring Service Uberx metrics
  • 12.
    Da Vinci: Streamingplatform for data science KAFKA Aggregation job (Model generation) RocksDB Historical state Timeseries data Model evaluation Database / Elasticsearch updates Model Updates
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 19.
  • 20.
    Tooling ●  Athena manager ● Job configuration ●  Unit test framework ●  Graphite integration ●  Codahale library support ●  Maven archetype ●  Artifactory support
  • 21.
  • 24.
    Job configuration reference.conf Sandbox StagingProductionDev athena-core-lib application.conf Sandbox Staging ProductionDev Samza job
  • 25.
    Job configuration projects { artemis{ job_1 { mapper { envs.common { task.inputs = "kafka.topic_1" task.class = "com.uber.athena.SampleTaskClass" } envs.local = ${projects.artemis.job_1.mapper.envs.common} envs.local = { task.window.ms = 30000 } envs.sandbox = ${projects.artemis.job_1.mapper.envs.local} envs.sandbox = { yarn.package.path = "http://artifactory..../artifactory/libs-snapshot-local/com/uber/athena/.../hello-athena.tar.gz" } envs.staging = ${projects.artemis.job_1.mapper.envs.sandbox} envs.production = ${projects.artemis.job_1.mapper.envs.sandbox} }
  • 26.
  • 27.
    Unit test framework StreamJob StreamTask InitableTask WindowableTask TaskUnitTestHarness Message Listener Inject data Custom IncomingMessageEnvelope Custom MessageCollector
  • 28.
    Unit test framework StringclassTaskName = "com.uber.athena.test.TestProcessTask"; // Register the job to the test harness TaskUnitTestHarness<String, Integer> testProcessTask = new TaskUnitTestHarness<>(classTaskName, false, true); // Register a listener for validating the output of every process function testProcessTask.registerMessageListenerOnProcess(new KeyAsserterListener()); // Start the job testProcessTask.start(); // Inject data testProcessTask.inject(key,value); // Get the full output testWindowTask.getResult()
  • 29.
    Tooling ●  Athena manager ● Job configuration ●  Unit test framework ●  Graphite integration ●  Codahale library support ●  Maven archetype ●  Artifactory support
  • 30.
  • 31.
    Observations ●  YARN isnot bad ! ●  Offset lag and Buffered messages ●  Kafka Appender for ELK ●  Checkpoint topic partition incorrect count ●  Config validation needs improvement ●  Job restarts are complicated ●  Built-in Metrics are insufficient
  • 32.
    ●  Seamless upgrades ● Custom built-in serde support ●  Config validation enhancement ●  Auto benchmarking a Samza job ●  Unit test framework enhancement Upcoming Samza improvements
  • 33.