Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Seattle kafka meetup nov 2015 published siphon
1. Siphon - Kafka as DataBus in Microsoft
Nitin Kumar (nitin.kumar@Microsoft.com)
Dev Manager, Microsoft
https://www.linkedin.com/in/nikuma
2. Agenda
• Scale: Kafka at Microsoft (Bing, Ads, Office)
• Use Case: NRT Customer facing reports
• Kafka based Streaming Solution
• Collector
• Consumer Restful APIs
• Monitoring: Canary/Audit Trail
• Production Experience
• Key Takeaways
Wednesday, March 16, 2016
3. Scale: Kafka at Microsoft (Ads, Bing, Office)
Kafka Brokers 1000+ across 5 Datacenters
Operating System Windows Server 2012 R2
Hardware Spec 12 Cores, 32 GB RAM, 4x2 TB HDD (JBOD), 10 GB Network
Incoming Events 1 million per sec, (90 Billion per day, 100 TB per day)
Outgoing Events 5 million per sec, (1 Trillion per day, 500 TB per day)
Kafka Topics/Partitions 50+/5000+
Kafka version 0.8.1.1 (3 way replication)
Wednesday, March 16, 2016
4. Problem
Wednesday, March 16, 2016
Serving System{Q}
{R}
Online Fraud
Detection
ML
Classification
Aggregation Reporting DB
Keyword
1.5 hours 2.5 hours
Advertiser
Feature
Extraction
300
GB/h
200+
Features
Stats
25 TB
Log
Collection
Sorting /
Partitioning
What is the click through rate of my ad, that launched at 5pm?
5. Goals / Design Considerations
Wednesday, March 16, 2016
Reduce latency from 4 hours to 15 minutes
99.8% Log completeness Guarantees
Check pointing & Failure recovery
Exactly Once Semantic
Highly Available, Scalable and rolling upgrade
Reusing Existing C# Libraries
6. Siphon DataBus
Solution
{Q}
{R}
Kafka Audit
ML
Classification
Aggregation Reporting
DB
Keyword
1-2 sec (Minimize latency) < 15 minutes
Advertiser
Feature
Extraction
100
MBPS
200+
Features
Stats
25 TB
Wednesday, March 16, 2016
Serving System
Online Fraud
Detection
Kafka as a distributed Queue
StreamScope as a distributed processing system
StreamScope
7. Siphon
Wednesday, March 16, 2016
Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open Source
Microsoft Internal
Siphon
8. Collector – Data Ingestion (Producers)
• Http(s) Server
• Restful API with SSL support.
• Abstraction from Kafka
internals (Partition, Kafka version)
• Throttling, QPS Monitoring
• PII scrubbing
• Load balancing/failover
Device Proxy Services
Collector
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
Collector
Collector
LoadBalancer
Services Data Push
Agent
Services Data Pull (Agent)
Wednesday, March 16, 2016
Open Source
Microsoft Internal
Siphon
9. Consumer API (Push/Pull)
• Restful Pull API – Simple consumer
• Config driven subscriptions for preconfigured sinks like (HDFS, Cosmos, ELK).
Wednesday, March 16, 2016
Config (ZK)
Executor
Kafka .NET
Library
Kafka
Supported destinations –
• Cosmos
• Elastic Search
• Kafka
• HDFS
10. High Level
Consumer
Monitoring using Canary, Audit Trail
Device Proxy Services
Collector
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
Collector
Collector
LoadBalancer
Services Data Push
Agent
Services Data Pull (Agent)
Wednesday, March 16, 2016
Synthetic
message
Audit Trail
11. Production Experience
• System in production for 15 months
• End to End Advertiser report latency of 12+ minutes.
• Other use cases from Office, Bing.
• Integration with other streaming systems – Storm, Spark.
• Monitoring using ELK
Wednesday, March 16, 2016
12. Key Takeaways
• Scale out with Kafka (50K -> 1M -> multi-million Events Per sec)
• Ability to build tunable Auditing/Monitoring
• Producer/Consumer Restful API provides a nice abstraction
• Config driven Pub/Sub system
Wednesday, March 16, 2016
13. “We are Hiring.”
Thank You
Nitin Kumar (nitin.kumar@Microsoft.com)
https://www.linkedin.com/in/nikuma
Wednesday, March 16, 2016
Editor's Notes
Client Agent generates a unique BatchId for every batch and appends it to the Extended Header.
Client Agent sends a “produced” audit message (BatchId, DateTime, Number of Records) to the audit system.
Each consumer, upon receiving the batch, de-serialize the header to extract BatchId and sends a “consumed” audit signal to the Audit system.
Audit system compares produced vs consumed audits every 5 minutes and raise alerts.