SlideShare a Scribd company logo
1 of 63
Download to read offline
TWITTER IS REAL TIME
WHAT IS REAL TIME?
REAL TIME PIPELINE
REAL TIME COMPONENTS
REAL TIME USE CASES
ETL BI
PRODUCT
SAFETY
TRENDS
ML MEDIA OPS ADS
20 PB
2 Trillion
Events/Day
100 ms
e2e
latency
400 Real
Time Jobs
DLOG &
HERON are
Open
Sourced
WE ARE HIRING!
Messaging
Data Infrastructure
Core Services
Search Infrastructure
Traffic
Real Time Compute
Compute Platform
Platform Engineering
Kernel
#LoveWhereYouWork
Learn more at careers.twitter.com
Hadoop
Core Data Libraries
Data Applications
Core Metrics
- Easy operations
- Small technology portfolio
- Quick development Iteration
- Diverse use cases
Bookkeeper
Write
Proxy
Read
Proxy
client
client
Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
20 PB
2 Trillion Events
100 ms
e2e
latency
- Event
A discrete, self-contained, piece of data
- Stream
A persistent, unordered collection of events with a time
- Partition
A portion of a stream with a proportional amount of that the overall capacity
- Subscriber
A collection of processes collectively consuming a copy of the stream
Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
Flow Control
Stream
Configuration
Partition
Ownership
DistributedLog
(E => Future[Unit])
Offset
Tracking
Offset
Store
Metadata
DL Read
Proxy
@DistributedLog
http://distributedlog.io
Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny
<@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar
<@mahakp>, Philip Su <@philipsu522>, Yiming Zang
<@zang_yiming>
Messaging Alumni: David Helder, Aniruddha Laud, Robin
Dhamankar
STORM/HERON TERMINOLOGY
- TOPOLOGY
Directed acyclic graph
Vertices=computation, and edges=streams of data tuples
- SPOUTS
Sources of data tuples for the topology
Examples - Kafka/Distributed Log/MySQL/Postgres
- BOLTS
Process incoming tuples and emit outgoing tuples
Examples - filtering/aggregation/join/arbitrary function
STORM/HERON TOPOLOGY
BOLT 1
BOLT 2
BOLT 3
BOLT 4
BOLT 5
SPOUT 1
SPOUT 2
WHY HERON?
● SCALABILITY and PERFORMANCE PREDICTABILITY
● IMPROVE DEVELOPER PRODUCTIVITY
● EASE OF MANAGEABILITY
TOPOLOGY ARCHITECTURE
Topology
Master
ZK
CLUSTER
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics
Manager
Metrics
Manager
HERON ARCHITECTURE
Topology 1
TOPOLOGY
SUBMISSION
Scheduler
Topology 2
Topology 3
Topology N
HERON SAMPLE TOPOLOGIES
Large amount of data
produced every day
Large cluster Several hundred
topologies deployed
Several million
messages every second
HERON @TWITTER
1 stage 10 stages
3x reduction in cores and memory
Heron has been in production for 2 years
STRAGGLERS
Stragglers are the norm in a multi-tenant distributed systems
● BAD/SLOW HOST
● EXECUTION SKEW
● INADEQUATE PROVISIONING
APPROACHES TO HANDLE STRAGGLERS

d
● SENDERS TO STRAGGLER DROP DATA
● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
● DETECT STRAGGLERS AND RESCHEDULE THEM
S1 B2
B3
SLOW DOWN SENDERS STRATEGY
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
S1 S1
S1S1
BACK PRESSURE IN PRACTICE
● IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
● SOMETIMES USER PREFER DROPPING OF DATA
Care about only latest data
● SUSTAINED BACK PRESSURE
Irrecoverable GC cycles
Bad or faulty host
ENVIRONMENT'S SUPPORTED
STORM API
PRE- 1.0.0
POST 1.0.0

SUMMINGBIRD FOR HERON
CURIOUS TO LEARN MORE…
INTERESTED IN HERON?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING
● 100K+ Advertisers, $2B+ revenue/year
● 300M+ Users
● Impressions/Engagements
○ Tens of billions of events daily
Use Heron & EventBus:
● Prediction
● Serving
● Analytics
● Online learning: models require real-time data
○ On-going training for existing ads
■ CTR, conversions, RTs, Likes
○ On-going training for user data
■ Interests change, targeting must stay relevant
○ New ads arrive constantly
● Consumes 150 GB/second from EventBus streams
Ad Server
● Reads Prediction models
● Finalizes Ad selection
● Writes 56GB/second to EventBus
○ Served impressions
○ Spend events
Callback Service
● Receives engagements from clients
● Writes engagements to EventBus
○ Consumed by Prediction
and Analytics
Advertiser Dashboard keeps advertisers informed in real-time
For Ads:
● Impressions
● Engagements
● Spend rate
● Uniques
For Users:
● Geolocation
● Gender
● Age
● Followers
● Keywords
● Interests
Offline layer (hours)
● Engagement log
● Billing pipeline
● 14TB/hour
Online layer (seconds)
● Heron topologies read 1M events/sec
From EventBus, provide real-time analytics
Advertiser Dashboard
● Ad-hoc queries for desired time range
● View performance of ads in real-time
http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
(~6 hrs)
#RealTime processing helps us scale our Ads
business:
● Prediction - Online learning
○ Ads
○ Users
● Analytics - Advertisers get real-time
visibility into ad performance
This enables us to provide high ROI for
Advertisers.
Image Credits:
http://images.clipartpanda.com/cycle-clipart-bike_red.png
http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png
http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
Observation
● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter
● Spam campaign comes in large batch
● Despite of randomized tweaks, enough similarity among spammy entities are preserved
Requirement
● Real-time : a competition game with spammers i.e. “detect” vs “mutate”
● Generic : need to support all common feature representations
Crest is a generic online similarity clustering system
● Inputs are a stream of entities
● Similar Clustering system groups similar entities together ( according to predefined
similarity metric)
● outputs are the clusters and the cluster entity members.
“Built on top of Heron“ https://github.com/twitter/heron
● Locality sensitive hashing
probabilistic similarity-preserving random projection method
Entity1 => hashValue1 (010010001110010100101001000011)
Entity2 => hashValue2 (000111001110010101100110100100)
Sim(Entity1, Entity2) ~ Sim(hash1, hash2)
● No “Pair-wise” similarity calculation
Similarity match based on “signature band”
Similarity match based on “signature band” collision
Cut signatures into bands:
01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)
Two entities become similarly candidates, if they collide on at least one band.
(i.e. need to match all signatures within some band)
1. Given entity features, calculate signatures, and cut into bands
2. Match with all existing clusters in cluster store, which collide with at least one band
3. Find the closest cluster
Incoming Entity: 01001 00011 10010 10010 10010 00011
Known Cluster1: 01011 00011 01010 10111 11110 10011
Known Cluster2: 01101 01011 01000 10010 10010 01111
1. Count for each band signatures
2. Use Count-Min Sketch to find the hot signatures
3. Send entities with hot signatures for clustering
1. Group entities by band signatures
2. Run in-memory clustering algorithm when the group is big enough
3. Save the cluster in cluster key-value store
1. Real-time : streamline data processing flow
2. Scalability : flexible grouping and shuffling ( Application / Signature )
3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
● Crest : similarity clustering system , based on locality-sensitive
hashing
● Detect spam in real-time , built on top of heron topology
● Generic interface, clustering “everything” happening in Twitter
#TwitterRealTime - Real time processing @twitter

More Related Content

What's hot

Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaZach Cox
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud lohitvijayarenu
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexingSeoeun Park
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with DruidYousun Jeong
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wildAtif Akhtar
 
Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteGigaom
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talkDanny Yuan
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Ashley Brown
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 

What's hot (20)

Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and Samza
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wild
 
Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, Confluent
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 

Similar to #TwitterRealTime - Real time processing @twitter

Keystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architectureKeystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architectureZhenzhong Xu
 
High throughput data streaming in Azure
High throughput data streaming in AzureHigh throughput data streaming in Azure
High throughput data streaming in AzureAlexander Laysha
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...Amazon Web Services
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ NetflixIdo Shilon
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 

Similar to #TwitterRealTime - Real time processing @twitter (20)

Keystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architectureKeystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architecture
 
High throughput data streaming in Azure
High throughput data streaming in AzureHigh throughput data streaming in Azure
High throughput data streaming in Azure
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 

Recently uploaded

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

#TwitterRealTime - Real time processing @twitter

  • 1.
  • 2.
  • 4. WHAT IS REAL TIME?
  • 7. REAL TIME USE CASES ETL BI PRODUCT SAFETY TRENDS ML MEDIA OPS ADS
  • 8. 20 PB 2 Trillion Events/Day 100 ms e2e latency 400 Real Time Jobs DLOG & HERON are Open Sourced
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. WE ARE HIRING! Messaging Data Infrastructure Core Services Search Infrastructure Traffic Real Time Compute Compute Platform Platform Engineering Kernel #LoveWhereYouWork Learn more at careers.twitter.com Hadoop Core Data Libraries Data Applications Core Metrics
  • 14.
  • 15. - Easy operations - Small technology portfolio - Quick development Iteration - Diverse use cases
  • 18. 20 PB 2 Trillion Events 100 ms e2e latency
  • 19. - Event A discrete, self-contained, piece of data - Stream A persistent, unordered collection of events with a time - Partition A portion of a stream with a proportional amount of that the overall capacity - Subscriber A collection of processes collectively consuming a copy of the stream
  • 21. Flow Control Stream Configuration Partition Ownership DistributedLog (E => Future[Unit]) Offset Tracking Offset Store Metadata DL Read Proxy
  • 22. @DistributedLog http://distributedlog.io Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny <@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar <@mahakp>, Philip Su <@philipsu522>, Yiming Zang <@zang_yiming> Messaging Alumni: David Helder, Aniruddha Laud, Robin Dhamankar
  • 23.
  • 24.
  • 25. STORM/HERON TERMINOLOGY - TOPOLOGY Directed acyclic graph Vertices=computation, and edges=streams of data tuples - SPOUTS Sources of data tuples for the topology Examples - Kafka/Distributed Log/MySQL/Postgres - BOLTS Process incoming tuples and emit outgoing tuples Examples - filtering/aggregation/join/arbitrary function
  • 26. STORM/HERON TOPOLOGY BOLT 1 BOLT 2 BOLT 3 BOLT 4 BOLT 5 SPOUT 1 SPOUT 2
  • 27. WHY HERON? ● SCALABILITY and PERFORMANCE PREDICTABILITY ● IMPROVE DEVELOPER PRODUCTIVITY ● EASE OF MANAGEABILITY
  • 28. TOPOLOGY ARCHITECTURE Topology Master ZK CLUSTER Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager
  • 30.
  • 32. Large amount of data produced every day Large cluster Several hundred topologies deployed Several million messages every second HERON @TWITTER 1 stage 10 stages 3x reduction in cores and memory Heron has been in production for 2 years
  • 33.
  • 34. STRAGGLERS Stragglers are the norm in a multi-tenant distributed systems ● BAD/SLOW HOST ● EXECUTION SKEW ● INADEQUATE PROVISIONING
  • 35. APPROACHES TO HANDLE STRAGGLERS  d ● SENDERS TO STRAGGLER DROP DATA ● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER ● DETECT STRAGGLERS AND RESCHEDULE THEM
  • 36. S1 B2 B3 SLOW DOWN SENDERS STRATEGY Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 S1 S1 S1S1
  • 37. BACK PRESSURE IN PRACTICE ● IN MOST SCENARIOS BACK PRESSURE RECOVERS Without any manual intervention ● SOMETIMES USER PREFER DROPPING OF DATA Care about only latest data ● SUSTAINED BACK PRESSURE Irrecoverable GC cycles Bad or faulty host
  • 38. ENVIRONMENT'S SUPPORTED STORM API PRE- 1.0.0 POST 1.0.0  SUMMINGBIRD FOR HERON
  • 39. CURIOUS TO LEARN MORE…
  • 40. INTERESTED IN HERON? CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron http://heronstreaming.io HERON IS OPEN SOURCED FOLLOW US @HERONSTREAMING
  • 41.
  • 42.
  • 43. ● 100K+ Advertisers, $2B+ revenue/year ● 300M+ Users ● Impressions/Engagements ○ Tens of billions of events daily
  • 44. Use Heron & EventBus: ● Prediction ● Serving ● Analytics
  • 45.
  • 46. ● Online learning: models require real-time data ○ On-going training for existing ads ■ CTR, conversions, RTs, Likes ○ On-going training for user data ■ Interests change, targeting must stay relevant ○ New ads arrive constantly ● Consumes 150 GB/second from EventBus streams
  • 47. Ad Server ● Reads Prediction models ● Finalizes Ad selection ● Writes 56GB/second to EventBus ○ Served impressions ○ Spend events Callback Service ● Receives engagements from clients ● Writes engagements to EventBus ○ Consumed by Prediction and Analytics
  • 48. Advertiser Dashboard keeps advertisers informed in real-time For Ads: ● Impressions ● Engagements ● Spend rate ● Uniques For Users: ● Geolocation ● Gender ● Age ● Followers ● Keywords ● Interests
  • 49. Offline layer (hours) ● Engagement log ● Billing pipeline ● 14TB/hour Online layer (seconds) ● Heron topologies read 1M events/sec From EventBus, provide real-time analytics Advertiser Dashboard ● Ad-hoc queries for desired time range ● View performance of ads in real-time http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
  • 51. #RealTime processing helps us scale our Ads business: ● Prediction - Online learning ○ Ads ○ Users ● Analytics - Advertisers get real-time visibility into ad performance This enables us to provide high ROI for Advertisers. Image Credits: http://images.clipartpanda.com/cycle-clipart-bike_red.png http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
  • 52.
  • 53. Observation ● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter ● Spam campaign comes in large batch ● Despite of randomized tweaks, enough similarity among spammy entities are preserved Requirement ● Real-time : a competition game with spammers i.e. “detect” vs “mutate” ● Generic : need to support all common feature representations
  • 54. Crest is a generic online similarity clustering system ● Inputs are a stream of entities ● Similar Clustering system groups similar entities together ( according to predefined similarity metric) ● outputs are the clusters and the cluster entity members. “Built on top of Heron“ https://github.com/twitter/heron
  • 55.
  • 56. ● Locality sensitive hashing probabilistic similarity-preserving random projection method Entity1 => hashValue1 (010010001110010100101001000011) Entity2 => hashValue2 (000111001110010101100110100100) Sim(Entity1, Entity2) ~ Sim(hash1, hash2) ● No “Pair-wise” similarity calculation Similarity match based on “signature band”
  • 57. Similarity match based on “signature band” collision Cut signatures into bands: 01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band) Two entities become similarly candidates, if they collide on at least one band. (i.e. need to match all signatures within some band)
  • 58. 1. Given entity features, calculate signatures, and cut into bands 2. Match with all existing clusters in cluster store, which collide with at least one band 3. Find the closest cluster Incoming Entity: 01001 00011 10010 10010 10010 00011 Known Cluster1: 01011 00011 01010 10111 11110 10011 Known Cluster2: 01101 01011 01000 10010 10010 01111
  • 59. 1. Count for each band signatures 2. Use Count-Min Sketch to find the hot signatures 3. Send entities with hot signatures for clustering
  • 60. 1. Group entities by band signatures 2. Run in-memory clustering algorithm when the group is big enough 3. Save the cluster in cluster key-value store
  • 61. 1. Real-time : streamline data processing flow 2. Scalability : flexible grouping and shuffling ( Application / Signature ) 3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
  • 62. ● Crest : similarity clustering system , based on locality-sensitive hashing ● Detect spam in real-time , built on top of heron topology ● Generic interface, clustering “everything” happening in Twitter