Event Data Processing with Streamlio

Engineering Real Time
Event Driven Processing
Karthik Ramasamy, Co-founder
Streamlio

Information Age
!2
Ká !
Event data is key

Increasingly Connected World
!3
Internet of Things
30 B connected devices by 2020
Health Care
153 Exabytes (2013) -> 2314 Exabytes
(2020)
Machine Data
40% of digital universe by 2020
Connected Vehicles
Data transferred per vehicle per month
4 MB -> 5 GB
Digital Assistants (Predictive Analytics)
$2B (2012) -> $6.5B (2019) [1]
Siri/Cortana/Google Now
Augmented/Virtual Reality
$150B by 2020 [2]
Oculus/HoloLens/Magic Leap
Ñ
+
>

Observations
• Fight spammy content, engagements, and behaviors in Twitter
• Spam campaign comes in large batch
• Despite randomized tweaks, enough similarity among spammy entities are
preserved
Requirement
• Real time - a competition with spammers (i.e) “detect” vs “mutate”
• Generic - need to support all common feature representations
Product Safety
!5

Product Safety - System Overview
!6
KV store for
clustering
Messaging
System (Event
Bus)
Similarity Clustering (Heron)

Real Time Ads
!7
KV store
Messaging
System (Event
Bus)
Ads Serving (Heron)
Ads Prediction (Heron)
Impressions
Spend
Ads Analytics (Heron)
Engagements
Spend
Ads Requests
Ads Responses
Impressions
Spend

Connected Cars
!8
KV store for
clustering
Messaging
System
Traffic Patterns
Messaging
System
Data Capture/Filter
Fuel Efficiency

Recurring Pattern
!10
ProcessMessaging
Storage
Data Ingestion Data Processing
Results StorageData Storage
Data
Serving

State of the World
!11
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines

Towards Unification and Simplification
!12
Interactive
Querying
Storm API Streamlets SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Kafka
API

Apache Pulsar highlights
!14
Stream-Native Functions
*NEW*
Apply processing functions on
data, fully managed by Pulsar
Multi-tenancy
A single cluster can support
many tenants and use cases
High throughput
Millions of messages/s in a
single partition
Durability
Data replicated and synced
to disk
Geo-replication
Out of box support for
geographically distributed
applications
Unified messaging model
Support both Topic & Queue
semantic in a single model
Delivery Guarantees
At least once, at most once
and effectively once
Low Latency
Low publish latency of 5ms
at 99pct
Scalability
Supports millions of topics in
a single cluster

Pulsar Architecture
!15
Serving
Brokers can be added independently
Traffic can be shifted quickly across brokers
Storage
Bookies can be added independently
New bookies will ramp up traffic quickly

Segment Centric Architecture
!16

Pulsar multi-datacenter replication
!17
Geo-replication
Asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
Topic (T1) Topic (T1)
Topic (T1)
Subscription
(S1)
Subscription
(S1)
Producer
(P1)
Consumer
Producer
(P3)
Producer
(P2)
Consumer
Data Center A Data Center B
Data Center C

Apache Heron design goals
!19
Efficiency
Reduce resource
consumption
Support for diverse
workloads
Throughput vs latency
sensitive
Support for multiple
semantics
At most once, At least once,
Effectively once
Native Multi-Language
Support
C++, Java, Python
Task Isolation
Ease of debug-ability/isolation/
profiling
Support for back pressure
Topologies should be self
adjusting
Use of containers
Runs in schedulers - Kubernetes &
DCOS & many more
Multi-level APIs
Procedural, Functional and Declarative
for diverse applications
Diverse deployment models
Run as a service or pure library

Heron data model
!20
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5

Writing Heron topologies
!21
Procedural - Low Level API
Directly write your spouts and bolts
Functional - Mid Level API
Use of maps, flat maps, transform, windows
Declarative - SQL (in the works)
Use of declarative language - specify what you
want, system will figure it out.
,
%

Topology execution
!22
Topology Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics
Manager
Metrics
Manager
Metrics
Manager
Health
Manager
MASTER
CONTAINER

Heron topology scale
!24
CONTAINERS - 1 TO 600
INSTANCES - 10 TO 6000

• No more pages during midnight for Heron team
• Very rare incidents for Heron customer teams
• Easy to debug during incident for quick turn around
• Reduced resource utilization saving cost
Heron impact at Twitter
!25

Computation across batch/streaming is similar
• Expressed as DAGS
• Run in parallel on the cluster
• Intermediate results need not be materialized
• Functional/Declarative APIs
Storage is the key
• Messaging/Storage are two faces of the same coin
• They serve the same data
Observations
!27

• Be able to write and read streams of records with low latency,
storage durability
• Data storage should be durable, consistent and fault tolerant
• Enable clients to stream or tail ledgers to propagate data as
they’re written
• Store and provide access to both historic and real-time data
Storage Requirements
!28

Companies using the technology
!30

Event Data Processing with Streamlio

Event Data Processing with Streamlio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Event Data Processing with Streamlio

Similar to Event Data Processing with Streamlio (20)

More from Streamlio

More from Streamlio (10)

Recently uploaded

Recently uploaded (20)

Event Data Processing with Streamlio