Siphon is a highly available and reliable distributed pub/sub system built using Apache Kafka. It is used to publish, discover and subscribe to near real-time data streams for operational and product intelligence. Siphon is used as a “Databus” by a variety of producers and subscribers in Microsoft, and is compliant with security and privacy requirements. It has a built-in Auditing and Quality control. This session will provide an overview of the use of Kafka at Microsoft, and then deep dive into Siphon. We will describe an important business scenario and talk about the technical details of the system in the context of that scenario. We will also cover the design and implementation of the service, the scale, and real world production experiences from operating the service in the Microsoft cloud environment.
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
1. Thursday, April 14, 2016
Siphon – Near Real Time Databus Using Kafka
Eric Boyd – CVP Engineering – Microsoft
Nitin Kumar – Principal Eng Manager - Microsoft
9. Bing Ads Execution
• Shipped once every 6
months
• Averaged 3 marketplace
experiments per month
• Big bets on marketplace
features that didn’t work.
• Focused teams on 6 tracks with
independent metrics.
• Pushed teams to ship as quickly as they
could, focusing only on moving their
metric.
• Built/borrowed infrastructure to enable
much more rapid experimentation.
• Over 3 years got to a rate of >1000
experiments a month
11. What drove the turnaround?
• Focus on small teams with clear metrics each team was driving.
• Pushing each team to experiment and iterate as fast as possible. Data
alone determines what gets shipped.
• Iterated on key metrics until we found the ones with the most impact.
• Commitment that we would get 1.5-2% better each month, and ship a
package of experimentally tested improvements each month.
12. Relationship with Open Source
• From “Linux is a cancer…”
• To contributing to open source
• Storm with C# - SCP.NET (http://www.nuget.org/packages/Microsoft.SCP.Net.SDK/)
• Spark with C# - Mobius (https://github.com/Microsoft/Mobius)
• Kafka with C# - C# Client for Kafka (https://github.com/Microsoft/Kafkanet)
• BOND (https://github.com/Microsoft/bond)
• Across MSFT
• C#
• VSCode
• Hyper-V drivers for Linux
• https://github.com/Microsoft/ with 18 pages of repositories!
13. Microsoft Big Data History
• Massive batch oriented systems
• Hundreds of thousands of machines
• Exabytes of storage
• SQL-like language with C# extensions
16. Vision
• A Databus for all Near Real Time (NRT) data in an organization.
• Quick and Easy Publication, Discovery and Subscription of NRT
dataset.
• Compatibility with various Stream Processing systems like
Storm, Spark, Splunk.
18. Usage
Bing Ads Campaign perf
Bing Live site telemetry
Cortana
Office 365
0
10
20
30
40
50
60
70
80
Throughput(inGBps)
Siphon Data Volume (Ingress and Egress)
Volume published (GBps) Volume subscribed (GBps) Total Volume (GBps)
0
2
4
6
8
10
12
14
16
18
Throughput(eventspersec)Millions
Siphon Events per second (Ingress and Egress)
EPS In Eps Out Total EPS
1.3 million
EVENTS PER SECOND INGRESS AT PEAK
~1 trillion
EVENTS PER DAY PROCESSED AT PEAK
3.5 petabytes
PROCESSED PER DAY
100 thousand
UNIQUE DEVICES AND MACHINES
1,300
PRODUCTION KAFKA BROKERS
19. Scale: Kafka at Microsoft (Ads, Bing, Office)
Kafka Brokers 1300+ across 5 Datacenters
Operating System Windows Server 2012 R2
Hardware Spec 12 Cores, 32 GB RAM, 4x2 TB HDD (JBOD), 10 GB Network
Incoming Events 1.3 million per sec, (112 Billion per day, 500 TB per day)
Outgoing Events 5 million per sec, (~1 Trillion per day, 3.5 PB per day)
Kafka Topics/Partitions 50+/5000+
Kafka version 0.8.1.1 (3 way replication)
20. Siphon Architecture
Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open Source
Microsoft Internal
Siphon
21. Multiple sources and schemas
Siphon
Bond
Schema
PartA
Main
Header
MessageId
AuditId
TimeStamp
PartB
Extended
Header
Key-Value[]
PartC
Payload
CSV
XML
JSON
JSON
XML
CSV
Siphon
Bond
Schema
Bond (https://github.com/Microsoft/bond)
Cross platform framework for working with schematized data.
Cross language (de) serialization.
Similar to Protobuf, Thrift and AVRO.
22. Collector – Data Ingestion (Producer)
• Http(s) Server
• Restful API with SSL support.
• Abstraction from Kafka
internals (Partition, Kafka version)
• Throttling, QPS Monitoring
• PII scrubbing
• Load balancing/failover to multiple DCs
• Supported for both Windows and Linux
servers.
Device Proxy Services
Collector
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
Collector
Collector
LoadBalancer
Services Data Push
Agent
Services Data Pull (Agent)
Open Source
Microsoft Internal
Siphon
URL : http://localhost/produce/<version>?topic=<toipic>
Method : POST
23. Pull & Push Consumers
Virtual Network A
HLC
Pull
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P1
Collector
Collector
RESTAPI
Virtual
Network B
Pull
• RESTful API with SSL support
• Works for out of network consumers
• Supports metadata and data operation
• Implement Simple consumer APIs
• Spark streaming receiver for Kafka REST
Push
• Configurable push to destinations like HDFS,
Cosmos, Kafka.
• Utilizes KafkaNet - .NET High Level Consumer
(https://github.com/Microsoft/Kafkanet)
25. High Level
Consumer
Device Proxy Services
Collector
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
Collector
Collector
LoadBalancer
Services Data Push
Agent
Services Data Pull (Agent)
Audit Trail
Sampled vs Full
Auditing support
Data completeness – Audit Trail
26. Production Experience – Telemetry Charts
• Monitoring using ELK
• E2E Latency
• Data Completeness
• Processing Lag
• EPS breakdown by data
center.
27. Key Takeaways
• Scale out with Kafka (50K -> 1M -> multi-million Events Per sec)
• Ability to build tunable Auditing/Monitoring
• Producer/Consumer Restful API provides a nice abstraction
• Config driven Pub/Sub system