Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
TWITTER IS REAL TIME
WHAT IS REAL TIME?
REAL TIME PIPELINE
REAL TIME COMPONENTS
REAL TIME USE CASES
ETL BI
PRODUCT
SAFETY
TRENDS
ML MEDIA OPS ADS
20 PB
2 Trillion
Events/Day
100 ms
e2e
latency
400 Real
Time Jobs
DLOG &
HERON are
Open
Sourced
WE ARE HIRING!
Messaging
Data Infrastructure
Core Services
Search Infrastructure
Traffic
Real Time Compute
Compute Platfor...
- Easy operations
- Small technology portfolio
- Quick development Iteration
- Diverse use cases
Bookkeeper
Write
Proxy
Read
Proxy
client
client
Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
20 PB
2 Trillion Events
100 ms
e2e
latency
- Event
A discrete, self-contained, piece of data
- Stream
A persistent, unordered collection of events with a time
- Part...
Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
Flow Control
Stream
Configuration
Partition
Ownership
DistributedLog
(E => Future[Unit])
Offset
Tracking
Offset
Store
Meta...
@DistributedLog
http://distributedlog.io
Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny
<@franckcuny>, Jordan...
STORM/HERON TERMINOLOGY
- TOPOLOGY
Directed acyclic graph
Vertices=computation, and edges=streams of data tuples
- SPOUTS
...
STORM/HERON TOPOLOGY
BOLT 1
BOLT 2
BOLT 3
BOLT 4
BOLT 5
SPOUT 1
SPOUT 2
WHY HERON?
● SCALABILITY and PERFORMANCE PREDICTABILITY
● IMPROVE DEVELOPER PRODUCTIVITY
● EASE OF MANAGEABILITY
TOPOLOGY ARCHITECTURE
Topology
Master
ZK
CLUSTER
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physi...
HERON ARCHITECTURE
Topology 1
TOPOLOGY
SUBMISSION
Scheduler
Topology 2
Topology 3
Topology N
HERON SAMPLE TOPOLOGIES
Large amount of data
produced every day
Large cluster Several hundred
topologies deployed
Several million
messages every s...
STRAGGLERS
Stragglers are the norm in a multi-tenant distributed systems
● BAD/SLOW HOST
● EXECUTION SKEW
● INADEQUATE PRO...
APPROACHES TO HANDLE STRAGGLERS

d
● SENDERS TO STRAGGLER DROP DATA
● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
● DETEC...
S1 B2
B3
SLOW DOWN SENDERS STRATEGY
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2...
BACK PRESSURE IN PRACTICE
● IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
● SOMETIMES USER PREF...
ENVIRONMENT'S SUPPORTED
STORM API
PRE- 1.0.0
POST 1.0.0

SUMMINGBIRD FOR HERON
CURIOUS TO LEARN MORE…
INTERESTED IN HERON?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SO...
● 100K+ Advertisers, $2B+ revenue/year
● 300M+ Users
● Impressions/Engagements
○ Tens of billions of events daily
Use Heron & EventBus:
● Prediction
● Serving
● Analytics
● Online learning: models require real-time data
○ On-going training for existing ads
■ CTR, conversions, RTs, Likes
○ On-...
Ad Server
● Reads Prediction models
● Finalizes Ad selection
● Writes 56GB/second to EventBus
○ Served impressions
○ Spend...
Advertiser Dashboard keeps advertisers informed in real-time
For Ads:
● Impressions
● Engagements
● Spend rate
● Uniques
F...
Offline layer (hours)
● Engagement log
● Billing pipeline
● 14TB/hour
Online layer (seconds)
● Heron topologies read 1M ev...
(~6 hrs)
#RealTime processing helps us scale our Ads
business:
● Prediction - Online learning
○ Ads
○ Users
● Analytics - Advertise...
Observation
● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter
● Spam campaign comes in large batch...
Crest is a generic online similarity clustering system
● Inputs are a stream of entities
● Similar Clustering system group...
● Locality sensitive hashing
probabilistic similarity-preserving random projection method
Entity1 => hashValue1 (010010001...
Similarity match based on “signature band” collision
Cut signatures into bands:
01001 00011 10010 10010 10010 00011 ( 30 s...
1. Given entity features, calculate signatures, and cut into bands
2. Match with all existing clusters in cluster store, w...
1. Count for each band signatures
2. Use Count-Min Sketch to find the hot signatures
3. Send entities with hot signatures ...
1. Group entities by band signatures
2. Run in-memory clustering algorithm when the group is big enough
3. Save the cluste...
1. Real-time : streamline data processing flow
2. Scalability : flexible grouping and shuffling ( Application / Signature ...
● Crest : similarity clustering system , based on locality-sensitive
hashing
● Detect spam in real-time , built on top of ...
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Upcoming SlideShare
Loading in …5
×

#TwitterRealTime - Real time processing @twitter

1,857 views

Published on

Who: Karthik Ramasamy (@karthikz)
Date: September 20, 2016
Event: #TwitterRealTime

This slide deck consists of presentations from various teams about Twitter's real time infrastructure, the components it uses, and how they function. It includes presentations from David Rusek (@davidrusek), Maosong Fu (@Louis_Fumaosong), Sandy Strong (@st5are), and Yimin Tan (@YiminTan_Kevin).

Published in: Technology
  • Login to see the comments

#TwitterRealTime - Real time processing @twitter

  1. 1. TWITTER IS REAL TIME
  2. 2. WHAT IS REAL TIME?
  3. 3. REAL TIME PIPELINE
  4. 4. REAL TIME COMPONENTS
  5. 5. REAL TIME USE CASES ETL BI PRODUCT SAFETY TRENDS ML MEDIA OPS ADS
  6. 6. 20 PB 2 Trillion Events/Day 100 ms e2e latency 400 Real Time Jobs DLOG & HERON are Open Sourced
  7. 7. WE ARE HIRING! Messaging Data Infrastructure Core Services Search Infrastructure Traffic Real Time Compute Compute Platform Platform Engineering Kernel #LoveWhereYouWork Learn more at careers.twitter.com Hadoop Core Data Libraries Data Applications Core Metrics
  8. 8. - Easy operations - Small technology portfolio - Quick development Iteration - Diverse use cases
  9. 9. Bookkeeper Write Proxy Read Proxy client client
  10. 10. Bookkeeper Write Proxy Read Proxy PublisherSubscriber Read Write DistributedLog Metadata Self Serve
  11. 11. 20 PB 2 Trillion Events 100 ms e2e latency
  12. 12. - Event A discrete, self-contained, piece of data - Stream A persistent, unordered collection of events with a time - Partition A portion of a stream with a proportional amount of that the overall capacity - Subscriber A collection of processes collectively consuming a copy of the stream
  13. 13. Bookkeeper Write Proxy Read Proxy PublisherSubscriber Read Write DistributedLog Metadata Self Serve
  14. 14. Flow Control Stream Configuration Partition Ownership DistributedLog (E => Future[Unit]) Offset Tracking Offset Store Metadata DL Read Proxy
  15. 15. @DistributedLog http://distributedlog.io Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny <@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar <@mahakp>, Philip Su <@philipsu522>, Yiming Zang <@zang_yiming> Messaging Alumni: David Helder, Aniruddha Laud, Robin Dhamankar
  16. 16. STORM/HERON TERMINOLOGY - TOPOLOGY Directed acyclic graph Vertices=computation, and edges=streams of data tuples - SPOUTS Sources of data tuples for the topology Examples - Kafka/Distributed Log/MySQL/Postgres - BOLTS Process incoming tuples and emit outgoing tuples Examples - filtering/aggregation/join/arbitrary function
  17. 17. STORM/HERON TOPOLOGY BOLT 1 BOLT 2 BOLT 3 BOLT 4 BOLT 5 SPOUT 1 SPOUT 2
  18. 18. WHY HERON? ● SCALABILITY and PERFORMANCE PREDICTABILITY ● IMPROVE DEVELOPER PRODUCTIVITY ● EASE OF MANAGEABILITY
  19. 19. TOPOLOGY ARCHITECTURE Topology Master ZK CLUSTER Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager
  20. 20. HERON ARCHITECTURE Topology 1 TOPOLOGY SUBMISSION Scheduler Topology 2 Topology 3 Topology N
  21. 21. HERON SAMPLE TOPOLOGIES
  22. 22. Large amount of data produced every day Large cluster Several hundred topologies deployed Several million messages every second HERON @TWITTER 1 stage 10 stages 3x reduction in cores and memory Heron has been in production for 2 years
  23. 23. STRAGGLERS Stragglers are the norm in a multi-tenant distributed systems ● BAD/SLOW HOST ● EXECUTION SKEW ● INADEQUATE PROVISIONING
  24. 24. APPROACHES TO HANDLE STRAGGLERS  d ● SENDERS TO STRAGGLER DROP DATA ● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER ● DETECT STRAGGLERS AND RESCHEDULE THEM
  25. 25. S1 B2 B3 SLOW DOWN SENDERS STRATEGY Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 S1 S1 S1S1
  26. 26. BACK PRESSURE IN PRACTICE ● IN MOST SCENARIOS BACK PRESSURE RECOVERS Without any manual intervention ● SOMETIMES USER PREFER DROPPING OF DATA Care about only latest data ● SUSTAINED BACK PRESSURE Irrecoverable GC cycles Bad or faulty host
  27. 27. ENVIRONMENT'S SUPPORTED STORM API PRE- 1.0.0 POST 1.0.0  SUMMINGBIRD FOR HERON
  28. 28. CURIOUS TO LEARN MORE…
  29. 29. INTERESTED IN HERON? CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron http://heronstreaming.io HERON IS OPEN SOURCED FOLLOW US @HERONSTREAMING
  30. 30. ● 100K+ Advertisers, $2B+ revenue/year ● 300M+ Users ● Impressions/Engagements ○ Tens of billions of events daily
  31. 31. Use Heron & EventBus: ● Prediction ● Serving ● Analytics
  32. 32. ● Online learning: models require real-time data ○ On-going training for existing ads ■ CTR, conversions, RTs, Likes ○ On-going training for user data ■ Interests change, targeting must stay relevant ○ New ads arrive constantly ● Consumes 150 GB/second from EventBus streams
  33. 33. Ad Server ● Reads Prediction models ● Finalizes Ad selection ● Writes 56GB/second to EventBus ○ Served impressions ○ Spend events Callback Service ● Receives engagements from clients ● Writes engagements to EventBus ○ Consumed by Prediction and Analytics
  34. 34. Advertiser Dashboard keeps advertisers informed in real-time For Ads: ● Impressions ● Engagements ● Spend rate ● Uniques For Users: ● Geolocation ● Gender ● Age ● Followers ● Keywords ● Interests
  35. 35. Offline layer (hours) ● Engagement log ● Billing pipeline ● 14TB/hour Online layer (seconds) ● Heron topologies read 1M events/sec From EventBus, provide real-time analytics Advertiser Dashboard ● Ad-hoc queries for desired time range ● View performance of ads in real-time http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
  36. 36. (~6 hrs)
  37. 37. #RealTime processing helps us scale our Ads business: ● Prediction - Online learning ○ Ads ○ Users ● Analytics - Advertisers get real-time visibility into ad performance This enables us to provide high ROI for Advertisers. Image Credits: http://images.clipartpanda.com/cycle-clipart-bike_red.png http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
  38. 38. Observation ● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter ● Spam campaign comes in large batch ● Despite of randomized tweaks, enough similarity among spammy entities are preserved Requirement ● Real-time : a competition game with spammers i.e. “detect” vs “mutate” ● Generic : need to support all common feature representations
  39. 39. Crest is a generic online similarity clustering system ● Inputs are a stream of entities ● Similar Clustering system groups similar entities together ( according to predefined similarity metric) ● outputs are the clusters and the cluster entity members. “Built on top of Heron“ https://github.com/twitter/heron
  40. 40. ● Locality sensitive hashing probabilistic similarity-preserving random projection method Entity1 => hashValue1 (010010001110010100101001000011) Entity2 => hashValue2 (000111001110010101100110100100) Sim(Entity1, Entity2) ~ Sim(hash1, hash2) ● No “Pair-wise” similarity calculation Similarity match based on “signature band”
  41. 41. Similarity match based on “signature band” collision Cut signatures into bands: 01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band) Two entities become similarly candidates, if they collide on at least one band. (i.e. need to match all signatures within some band)
  42. 42. 1. Given entity features, calculate signatures, and cut into bands 2. Match with all existing clusters in cluster store, which collide with at least one band 3. Find the closest cluster Incoming Entity: 01001 00011 10010 10010 10010 00011 Known Cluster1: 01011 00011 01010 10111 11110 10011 Known Cluster2: 01101 01011 01000 10010 10010 01111
  43. 43. 1. Count for each band signatures 2. Use Count-Min Sketch to find the hot signatures 3. Send entities with hot signatures for clustering
  44. 44. 1. Group entities by band signatures 2. Run in-memory clustering algorithm when the group is big enough 3. Save the cluster in cluster key-value store
  45. 45. 1. Real-time : streamline data processing flow 2. Scalability : flexible grouping and shuffling ( Application / Signature ) 3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
  46. 46. ● Crest : similarity clustering system , based on locality-sensitive hashing ● Detect spam in real-time , built on top of heron topology ● Generic interface, clustering “everything” happening in Twitter

×