2. Scale
• Billions of search queries per month
• Hundreds of services power Bing stack
• Thousands of machines – several Data Centers
• Tens of TBs of events per hour
• Several data processing frameworks
3. Data Curation
• Events of individual services - little value
• Need correlation of events & curated datasets
– at scale, on time, high fidelity
– contributes directly to improving quality of services &
monetization
4. Data Pipelines
• Traditionally implemented entirely using Batch
processing in COSMOS infrastructure
o Storage – DFS (similar to HDFS)
o Execution – Dryad (general purpose, more expressive than map-reduce)
o Query – SCOPE (SQL 'style' scripting language that supports inline C#)
• Data pipelines are adopting near real-time
processing – new issues to address
5. NRT Data Pipelines
Key issues to address in stream processing
applications:
• Events generated in different DCs and at a rapid
rate
• Events arrive out of order
• Events are delayed or get lost
• Managing state can be very expensive and hard to
get right
6. Mobius
NRT Processing Scenario – Event Merge Pipeline
Merge - 10-minute App-Time Window
G G G G
U U U U
C C C C
Checkpoint
G G
U U
C C
Read
Unbalancedpartitions
Join by application-time
Slow Kafka Brokers
Find Offset-Range
Expensive
Output
Merged
Events
G U C
G
U
C
Data Center
G
U
C
Data Center
G
U
C
Data Center
7. Unbalanced Kafka Partitions
• Direct API - Kafka partition maps to RDD partition
• Largest partition is the long pole in processing
• Solution
– Repartition data from one Kafka partition into multiple
RDDs w/o extra shuffling cost of DStream.Repartition()
– Repartition threshold is configurable per topic
– DynamicPartitionKafkaRDD.scala at github.com/Microsoft/Mobius
8. Slow Kafka Brokers
• Slow Kafka brokers increase batch time
• Delay in starting the next batch accumulates
• Solution
– Submit Kafka data-fetch job on-time (defined by batch
interval) in a separate thread, even when previous
batch delayed
– CSharpDStream.scala at github.com/Microsoft/Mobius
9. Find Offset-Range Expensive
• Finding Offset-range for {DC X Topic X Partition}is expensive
– Several DCs – 3 topics each – average of 170 partitions per topic
– {Get metadata + get offset range} took 10 mins for 2 min batch window
• {Metadata refresh + Find Offset} and data processing not
parallel
• Solution
– Move find offset-range to a separate thread
– Materialize and cache Kafka RDD in that thread
– DynamicPartitionKafkaInputDStream.scala at github.com/Microsoft/Mobius
10. Join By Application-Time
• Application-time based join not available in Spark 1.*
• Solution
– Use custom join function in DStream.UpdateStateByKey()
– Custom join function enforces time window based on
application time
– UpdateStateByKey maintains partially joined events as the
state
– PairDStreamFunctions.cs at github.com/Microsoft/Mobius
13. Dynamic Repartition
After Dynamic Repartition
Pseudo Code
class DynamicPartitionKafkaRDD(kafkaPartitionOffsetRanges)
override def getPartitions {
// repartition threshold per topic loaded from config
val maxRddPartitionSize = Map<topic, partitionSize>
// apply max repartition threshold
kafkaPartitionOffsetRanges.flatMap { case o =>
val rddPartitionSize = maxRddPartitionSize(o.topic)
(o.fromOffset until o.untilOffset by rddPartitionSize).map(
s => (o.topic, o.partition, s, (o.untilOffset, s + rddPartitionSize)))
}
}
Source Code
DynamicPartitionKafkaRDD.scala - https://github.com/Microsoft /Mobius
2-minute
interval
Before Dynamic
Repartition
14. On-time Kafka fetch job submission
Job#A2
UpdateStateByK
ey
Job#A1
Fetch
data
Batch-A
New Thread
Job#A3
Checkpoint
Job#B2
UpdateStateByKey
Job#B1
Fetch
data
Batch-B
Job#B3
Checkpoint
Job#C2
UpdateStateByKe
y
Job#C1
Fetch
data
Batch-C
Job#C3
checkpoin
t
Main Thread
Submit Job
Pseudo Code
class CSharpStateDStream
override def compute {
val lastState = getOrCompute (validTime - batchInterval)
val rdd = parent.getOrCompute(validTime)
if (!lastBatchCompleted) {
// if last batch not complete yet
// run Fetch data job to materialize rdd in a separate thread
rdd.cache()
ThreadPool.execute(sc.runJob(rdd))
// wait for job to complele
}
<compute UpdateStateByKey Dstream>
}
Source Code
CSharpDStream.scala - https://github.com/Microsoft/Mobius
Driver perspective
15. Parallel Kafka metadata refresh + RDD materialization
Kafka
offset range
1
RDD.Cache
t
3
Main Thread
t
2
New Thread
t
1
Kafka
offset range
2
RDD.Cache
Kafka
offset range
3
RDD.Cache
batch 1 batch 2 batch 3
Batch Job Submission
Kafka Metadata refresh
Driver perspective
Pseudo Code
class DynamicPartitionKafkaInputDStream
// starts a separate schedule thread at refreshOffsetsInterval
refreshOffsetsScheduler.scheduleAtFixedRate(
<get offset ranges>
<generate kafka RDD>
// materialize and cache
sc.runJob(kafkaRdd.cache)
<enqueue kafka RDD>
)
override def compute {
<dequeue kafka RDD nonblockingly>
}
Source Code
DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius
16. Use UpdateStateByKey to join DStreams
G/W DStream
Click DStream
Batch job 1
RDD @ time 1
Batch job 2
RDD @ time 2
State DStream
UpdateStateByKey
Batch job 3
RDD @ time 3
Pseudo Code
Iterator[(K, S)] JoinFunction(
int pid, Iterator[(K:key, Iterator[V]:newEvents, S:oldState)] events)
{
val currentTime = events.newEvents.max(e => e.eventTime);
foreach (var e in events) {
val newState = <oldState join newEvents>
if (oldState.min(s => s.eventTime) + TimeWindow < currentTime) // TimeWindow 10minutes
<output to external storage>
else
return (key, newState)
}
}
UpdateStateByKey C# API
PairDStreamFunctions.cs, https://github.com/Microsoft/Mobius