SlideShare a Scribd company logo
1 of 9
Download to read offline
Top Five Lessons Learned in
Building Streaming Applications at
Microsoft Bing Scale
Renyi Xiong
Microsoft
Bing Scale Problem – Log Merging
Edge
3.2 GB/s
UI-Layout
2.8 GB/s
Click
69 MB/s
Kafka
in
every
DC
E E E E
U U U U
C C C C
Merged
Log
Raw Logs Databus Event Merge Pipeline
10-minute App-Time Window10
Kafka
Databus
1 2 3 4
• Merge Bing query events with click events
• Lambda architecture: batch- and stream-processing shares the same C#
library
• Spark Streaming in C#
Mobius Talk
• Tomorrow afternoon 3pm at Imperial
• Mobius: C# Language Binding for Spark
Lesson 1: Use UpdateStateByKey to join DStreams
• Problem
• Application time is not supported in Spark 1.6.
• Solution – UpdateStateByKey
• UpdateStateByKey takes a custom JoinFunction as input parameter;
• Custom JoinFunction enforces time window based on Application Time;
• UpdateStateByKey maintains partially joined events as the state
Edge DStream
Click DStream
Batch job 1
RDD @ time 1
Batch job 2
RDD @ time 2
State DStream
UpdateStateByKey
Batch job 3
RDD @ time 3
Pseudo Code
Iterator[(K, S)] JoinFunction(
int pid, Iterator[(K:key, Iterator[V]:newEvents, S:oldState)] events)
{
val currentTime = events.newEvents.max(e => e.eventTime);
foreach (var e in events) {
val newState = <oldState join newEvents>
if (oldState.min(s => s.eventTime) + TimeWindow < currentTime) // TimeWindow 10minutes
<output to external storage>
else
return (key, newState)
}
}
UpdateStateByKey C# API
PairDStreamFunctions.cs, https://github.com/Microsoft/Mobius
Lesson 2: Dynamic Repartition with Kafka Direct Approach
• Problem
• Unbalanced Kafka partitions caused delay in the pipeline
• Solution – Dynamic Repartition
1. Repartition data from one Kafka partition into multiple RDDs without extra
shuffling cost of DStream.Repartition
2. Repartition threshold is configurable per topic
After Dynamic Repartition
Pseudo Code
class DynamicPartitionKafkaRDD(kafkaPartitionOffsetRanges)
override def getPartitions {
// repartition threshold per topic loaded from config
val maxRddPartitionSize = Map<topic, partitionSize>
// apply max repartition threshold
kafkaPartitionOffsetRanges.flatMap { case o =>
val rddPartitionSize = maxRddPartitionSize(o.topic)
(o.fromOffset until o.untilOffset by rddPartitionSize).map(
s => (o.topic, o.partition, s, (o.untilOffset, s + rddPartitionSize)))
}
}
Source Code
DynamicPartitionKafkaRDD.scala - https://github.com/Microsoft/Mobius
2-minute interval Before Dynamic Repartition
Lesson 3: On-time Kafka fetch job submission
• Problem
• Overall job delay accumulates due to transient slow/hot Kafka
broker issue.
• Solution
• Submit Kafka Fetch job on batch interval in a separate thread,
even when previous batch delayed.
Job#A2
UpdateStateByKey
Job#A1
Fetch data
Batch-A
New Thread
Job#A3
Checkpoint
Job#B2
UpdateStateByKey
Job#B1
Fetch data
Batch-B
Job#B3
Checkpoint
Job#C2
UpdateStateByKey
Job#C1
Fetch data
Batch-C
Job#C3
checkpoint
Main Thread
Submit Job
Pseudo Code
class CSharpStateDStream
override def compute {
val lastState = getOrCompute (validTime - batchInterval)
val rdd = parent.getOrCompute(validTime)
if (!lastBatchCompleted) {
// if last batch not complete yet
// run Fetch data job to materialize rdd in a separate thread
rdd.cache()
ThreadPool.execute(sc.runJob(rdd))
// wait for job to complele
}
<compute UpdateStateByKey Dstream>
}
Source Code
CSharpDStream.scala - https://github.com/Microsoft/Mobius
Driver perspective
Lesson 4: Parallel Kafka metadata refresh
• Problem
• Fetching Kafka metadata from multiple data centers often takes more time than expected.
• Solution
• Customize DirectKafkaInputDStream, move metadata refresh for each {topic, data-center} to a
separate thread
Kafka
offset range 1
t3
Main Thread
t2
New Thread
t1
Kafka
offset range 2
Kafka
offset range 3
batch 1 batch 2 batch 3
Batch Job Submission
Kafka Metadata refresh
Pseudo Code
class DynamicPartitionKafkaInputDStream {
// starts a separate schedule thread at refreshOffsetsInterval
refreshOffsetsScheduler.scheduleAtFixedRate(
<get offset ranges>
<enqueue offset ranges>
)
override def compute {
<dequeue offset ranges nonblockingly>
<generate kafka RDD>
}
}
Source Code
DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius
Driver perspective
Lesson 5: Parallel Kafka metadata refresh + RDD materialization
• Problem
• Kafka data fetch and data processing not in parallel
• Solution
• Take Lesson 4 further
• In metadata refresh thread, materialize and cache
Kafka RDD
Kafka
offset range 1
RDD.Cache
t3
Main Thread
t2
New Thread
t1
Kafka
offset range 2
RDD.Cache
Kafka
offset range 3
RDD.Cache
batch 1 batch 2 batch 3
Batch Job Submission
Kafka Metadata refresh
Driver perspective
Pseudo Code
class DynamicPartitionKafkaInputDStream
// starts a separate schedule thread at refreshOffsetsInterval
refreshOffsetsScheduler.scheduleAtFixedRate(
<get offset ranges>
<generate kafka RDD>
// materialize and cache
sc.runJob(kafkaRdd.cache)
<enqueue kafka RDD>
)
override def compute {
<dequeue kafka RDD nonblockingly>
}
Source Code
DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius
THANK YOU.
• Special thanks to TD and Ram from Databricks for all the support
• Contact us
• Renyi Xiong, renyix@microsoft.com
• Kaarthik Sivashanmugam, ksivas@microsoft.com
• mobiuscore@microsoft.com

More Related Content

Viewers also liked

Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudJen Aman
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayJen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleJen Aman
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkSpark Summit
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
Vskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Developing apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobiusDeveloping apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobiusshareddatamsft
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 

Viewers also liked (20)

Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by Spark
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Vskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills certified html5 developer Notes
Vskills certified html5 developer Notes
 
Solution Architecture Cassandra
Solution Architecture CassandraSolution Architecture Cassandra
Solution Architecture Cassandra
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Developing apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobiusDeveloping apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobius
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Amazon Redshift Analytical functions
Amazon Redshift Analytical functionsAmazon Redshift Analytical functions
Amazon Redshift Analytical functions
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 

More from Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Top 5 Lessons Learned in Building Streaming Applications at Microsoft Bing Scale

  • 1. Top Five Lessons Learned in Building Streaming Applications at Microsoft Bing Scale Renyi Xiong Microsoft
  • 2. Bing Scale Problem – Log Merging Edge 3.2 GB/s UI-Layout 2.8 GB/s Click 69 MB/s Kafka in every DC E E E E U U U U C C C C Merged Log Raw Logs Databus Event Merge Pipeline 10-minute App-Time Window10 Kafka Databus 1 2 3 4 • Merge Bing query events with click events • Lambda architecture: batch- and stream-processing shares the same C# library • Spark Streaming in C#
  • 3. Mobius Talk • Tomorrow afternoon 3pm at Imperial • Mobius: C# Language Binding for Spark
  • 4. Lesson 1: Use UpdateStateByKey to join DStreams • Problem • Application time is not supported in Spark 1.6. • Solution – UpdateStateByKey • UpdateStateByKey takes a custom JoinFunction as input parameter; • Custom JoinFunction enforces time window based on Application Time; • UpdateStateByKey maintains partially joined events as the state Edge DStream Click DStream Batch job 1 RDD @ time 1 Batch job 2 RDD @ time 2 State DStream UpdateStateByKey Batch job 3 RDD @ time 3 Pseudo Code Iterator[(K, S)] JoinFunction( int pid, Iterator[(K:key, Iterator[V]:newEvents, S:oldState)] events) { val currentTime = events.newEvents.max(e => e.eventTime); foreach (var e in events) { val newState = <oldState join newEvents> if (oldState.min(s => s.eventTime) + TimeWindow < currentTime) // TimeWindow 10minutes <output to external storage> else return (key, newState) } } UpdateStateByKey C# API PairDStreamFunctions.cs, https://github.com/Microsoft/Mobius
  • 5. Lesson 2: Dynamic Repartition with Kafka Direct Approach • Problem • Unbalanced Kafka partitions caused delay in the pipeline • Solution – Dynamic Repartition 1. Repartition data from one Kafka partition into multiple RDDs without extra shuffling cost of DStream.Repartition 2. Repartition threshold is configurable per topic After Dynamic Repartition Pseudo Code class DynamicPartitionKafkaRDD(kafkaPartitionOffsetRanges) override def getPartitions { // repartition threshold per topic loaded from config val maxRddPartitionSize = Map<topic, partitionSize> // apply max repartition threshold kafkaPartitionOffsetRanges.flatMap { case o => val rddPartitionSize = maxRddPartitionSize(o.topic) (o.fromOffset until o.untilOffset by rddPartitionSize).map( s => (o.topic, o.partition, s, (o.untilOffset, s + rddPartitionSize))) } } Source Code DynamicPartitionKafkaRDD.scala - https://github.com/Microsoft/Mobius 2-minute interval Before Dynamic Repartition
  • 6. Lesson 3: On-time Kafka fetch job submission • Problem • Overall job delay accumulates due to transient slow/hot Kafka broker issue. • Solution • Submit Kafka Fetch job on batch interval in a separate thread, even when previous batch delayed. Job#A2 UpdateStateByKey Job#A1 Fetch data Batch-A New Thread Job#A3 Checkpoint Job#B2 UpdateStateByKey Job#B1 Fetch data Batch-B Job#B3 Checkpoint Job#C2 UpdateStateByKey Job#C1 Fetch data Batch-C Job#C3 checkpoint Main Thread Submit Job Pseudo Code class CSharpStateDStream override def compute { val lastState = getOrCompute (validTime - batchInterval) val rdd = parent.getOrCompute(validTime) if (!lastBatchCompleted) { // if last batch not complete yet // run Fetch data job to materialize rdd in a separate thread rdd.cache() ThreadPool.execute(sc.runJob(rdd)) // wait for job to complele } <compute UpdateStateByKey Dstream> } Source Code CSharpDStream.scala - https://github.com/Microsoft/Mobius Driver perspective
  • 7. Lesson 4: Parallel Kafka metadata refresh • Problem • Fetching Kafka metadata from multiple data centers often takes more time than expected. • Solution • Customize DirectKafkaInputDStream, move metadata refresh for each {topic, data-center} to a separate thread Kafka offset range 1 t3 Main Thread t2 New Thread t1 Kafka offset range 2 Kafka offset range 3 batch 1 batch 2 batch 3 Batch Job Submission Kafka Metadata refresh Pseudo Code class DynamicPartitionKafkaInputDStream { // starts a separate schedule thread at refreshOffsetsInterval refreshOffsetsScheduler.scheduleAtFixedRate( <get offset ranges> <enqueue offset ranges> ) override def compute { <dequeue offset ranges nonblockingly> <generate kafka RDD> } } Source Code DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius Driver perspective
  • 8. Lesson 5: Parallel Kafka metadata refresh + RDD materialization • Problem • Kafka data fetch and data processing not in parallel • Solution • Take Lesson 4 further • In metadata refresh thread, materialize and cache Kafka RDD Kafka offset range 1 RDD.Cache t3 Main Thread t2 New Thread t1 Kafka offset range 2 RDD.Cache Kafka offset range 3 RDD.Cache batch 1 batch 2 batch 3 Batch Job Submission Kafka Metadata refresh Driver perspective Pseudo Code class DynamicPartitionKafkaInputDStream // starts a separate schedule thread at refreshOffsetsInterval refreshOffsetsScheduler.scheduleAtFixedRate( <get offset ranges> <generate kafka RDD> // materialize and cache sc.runJob(kafkaRdd.cache) <enqueue kafka RDD> ) override def compute { <dequeue kafka RDD nonblockingly> } Source Code DynamicPartitionKafkaInputDStream.scala - https://github.com/Microsoft/Mobius
  • 9. THANK YOU. • Special thanks to TD and Ram from Databricks for all the support • Contact us • Renyi Xiong, renyix@microsoft.com • Kaarthik Sivashanmugam, ksivas@microsoft.com • mobiuscore@microsoft.com