Invited guest lecture at UCSC for MSc. Distributed System, Talk includes a recap of stream processing buzzwords with an introduction to dynamic graph streams.
Special Thanks goes to Martin Kleppman (LinkedIn) and Vasia Kalavri (KTH) for the knowledge hub
4. 07/10/16 MSc. Distributed Systems 4
The Streaming Era
● Today, most data is continuously produced
- user activity logs, web logs, sensors, database
transactions, …
● The common approach to analyze such data so far
- Record data stream to stable storage (DBMS,
HDFS, …)
- Periodically analyze data with batch
processing engine (DBMS, MapReduce, …)
● Streaming processing engines analyze data
while it arrives
5. 07/10/16 MSc. Distributed Systems 5
The Streaming Era (Contd.)
● Decreases the overall latency to obtain results
- No need to persist data in stable storage
- No periodic batch analysis jobs
● Simplifies the data infrastructure
- Fewer moving parts to be maintained and coordinated
● Makes time dimension of data explicit
- Each event has a timestamp
- Data can be processed based on timestamps
6. 07/10/16 MSc. Distributed Systems 6
Event Streams
[Immutable]
Web Page Event
Wikipedia Page Update Event
LinkedIn User Update Event
8. 07/10/16 MSc. Distributed Systems 8
Direct coupling
Strict Identity
Time coupling
Not good for volatile
environment
Not a good way to
communicate with
several participants
Space uncoupling
Anonymity
Time uncoupling
Independent lifetimes
between parties
Through persistent
communication
channel
Point-to-point
communication
Indirect
communication
9. 07/10/16 MSc. Distributed Systems 9
Taxonomy
Indirect
Communication
Communication
based
Group
communication
Message Queues
Publish/subscribe
State based
Tuple spaces
Distributed
Shared Memory
10. 07/10/16 MSc. Distributed Systems 10
Pub/Sub Messaging Pattern
Topic-based
- Each event belongs to a
number of topics (e.g.
“music”, “sport”)
- Users subscribe to topics
and receive all relevant
events
Content-based
- Users subscribe to the
actual content of the
events/ a structured
summary of it
- More expressive
11. 07/10/16 MSc. Distributed Systems 11
Pub/Sub Activities
Subscription processing
Indexing and storing subscriptions.
Event Stream Processing (ESP)
Pub/sub approach: upon arrival of events, access
subscription index and identify all matched
subscriptions.
Event delivery
deliver event to clients with matched subscriptions.
14. 07/10/16 MSc. Distributed Systems 14
“Buzzwords”
https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html
15. 07/10/16 MSc. Distributed Systems 15
Complex Event Processing (CEP)
● A set of event processing principles
● Match patterns of events
– Comparable to SQL queries
– High-level query language
● Cloud of causally related events
– POSET (Partially Ordered Set of Events)
16. 07/10/16 MSc. Distributed Systems 16
Complex Event Processing (CEP)
● Some CEP Examples:
– When 2 transactions happen on an account from
radically different geographic locations within a
certain time window then report as potential fraud.
– When a gold customer's trouble ticket is not
resolved within 1 hour, then escalate.
– When a team meeting request overlaps with my
lunch break, then deny the team meeting and
demote the meeting organizer.
17. 07/10/16 MSc. Distributed Systems 17
Complex Event Processing (CEP)
● Some CEP Examples:
– When 2 transactions happen on an account from
radically different geographic locations within a
certain time window then report as potential fraud.
– When a gold customer's trouble ticket is not
resolved within 1 hour, then escalate.
– When a team meeting request overlaps with my
lunch break, then deny the team meeting and
demote the meeting organizer.
18. 07/10/16 MSc. Distributed Systems 18
ESP and CEP
[Timeline]
2002
AuroraAurora
2003
Medusa
2005
Borealis
STREAM
TelegraphCQ
<20001989 - 1995
Rapide
Esper Apama
StreamBase
SQLStream
WSO2 CEP
2016
19. 07/10/16 MSc. Distributed Systems 19
ESP vs. CEP
http://www.slideshare.net/TimBassCEP/mythbusters-event-stream-processing-v-complex-event-processing-presentation
21. 07/10/16 MSc. Distributed Systems 21
Laundry of “Buzzwords”
● Actor Frameworks
– Better mechanism to handle concurrency
– E.g. Akka, Orleans and Erlang OTP
● “Reactive”
– Language semantics for bringing event streams to the user
interface
– Responsive, Resilient, Elastic and Message Driven
– E.g. Data flow languages, Functional reactive programming
● Event Sourcing
● Change Data Capture (CDC)
23. 07/10/16 MSc. Distributed Systems 23
https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html
24. 07/10/16 MSc. Distributed Systems 24
Target
● Better Scalability
● High Throughput
● Low latency
● Powerful semantics
● Easy integration
via Low Level
Stream Processing
Frameworks !!
25. 07/10/16 MSc. Distributed Systems 25
Spark Streaming
● General purpose computing engine to run batch,
interactive and streaming jobs
● Based on Resilient Distributed Datasets (RDD)
– Restricted form of distributed shared memory
– Immutable
– Can only be built through deterministic
transformations
● Efficient fault recovery using lineage graph
– Recompute lost partitions on failure
– No cost if nothing fails
26. 07/10/16 MSc. Distributed Systems 26
Spark Streaming (Contd.)
[Key concepts]
● DStream – sequence of RDDs representing a stream
of data
– HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
● Transformations – modify data from one DStream to
another
– Standard RDD operations – map, countByValue, reduce,
join, …
– Stateful operations – window, countByValueAndWindow, …
● Output Operations – send data to external entity
– saveAsHadoopFiles – saves to HDFS
– foreach – do anything with each batch of results
27. 07/10/16 MSc. Distributed Systems 27
Spark Streaming (Contd.)
● Run a streaming computation as a series of very small,
deterministic batch jobs
– Chop up the live stream into batches of X seconds
– Spark treats each batch of data as RDDs and processes them using RDD
operations
– Finally, the processed results of the RDD operations are returned in
batches
30. 07/10/16 MSc. Distributed Systems 30
Apache Storm
[Key concepts]
● Tuple
– Core Unit of Data
– Immutable Set of Key/Value Pairs
● Spouts
– Source of Streams
– Wraps a streaming data source and emits Tuples
● Bolts
– Core functions of a streaming computation
– Receive tuples and do stuff
– Optionally emit additional tuples
31. 07/10/16 MSc. Distributed Systems 31
Apache Storm
[Key concepts]
● Topology
– DAG of Spouts and
Bolts
– Data Flow
Representation
– Streaming
Computation
43. 07/10/16 MSc. Distributed Systems 43
Streaming Machine Learning
● By using a programing
abstraction for
distributed streaming
– Apache SAMOA
44. 07/10/16 MSc. Distributed Systems 44
Graph Stream Processing
Referred Author: Vasia Kalavri, KTH
https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/buzzwords-
kalavri.pdf
45. 07/10/16 MSc. Distributed Systems 45
Static Graph Processing
● Load: read the graph from disk and partition
it in memory
● Compute: read and mutate the graph state
● Store: write the final graph state back to
disk
46. 07/10/16 MSc. Distributed Systems 46
Static Graph Processing
[Drawbacks]
● It is slow
– wait until the computation is over before you see
any result
– pre-processing and partitioning
● It is expensive
– lots of memory and CPU required in order to scale
● It requires re-computation for graph changes
– no efficient way to deal with updates
47. 07/10/16 MSc. Distributed Systems 47
Streaming Graph Processing
We consume events in real-time
● Get results faster
– No need to wait for the job to finish
– Sometimes, early approximations are better
than late exact answers
● Get results continuously
– Process unbounded number of events
48. 07/10/16 MSc. Distributed Systems 48
Real-world scenarios
● Targeted Advertisement
– Finding Strongly Connected Components in a
social network graph
– Targeted chain of advertisement on detected
communities
Jane Joe
knows
#Tesla
posts
likes
Self driving cars
Ads
Peter Taphouse
checks-in
John
subscribes
Dinner Offer
Ads
49. 07/10/16 MSc. Distributed Systems 49
Streaming Graph Processing
[Challenges]
● Maintain the graph structure
– How to apply state updates efficiently?
● Result updates
– Re-run the analysis for each event?
– Design an incremental algorithm?
– Run separate instances on multiple snapshots?
● How to preserve graph properties?
– Natural behavior?
50. 07/10/16 MSc. Distributed Systems 50
Streaming Graph Processing
[Current Research]
Each event is an edge addition
Jane Joe
knows
Jane #Tesla
likes
Joe #Tesla
posts
Peter TapHouse
checks-in
52. 07/10/16 MSc. Distributed Systems 52
Dynamic Graph Processing
● Instead of analyzing the whole graph
– Analyze it's properties by preserving them
continuously
● Connectivity or Distance (spanners)
● Graph cut estimation (sparsifiers)
● Neighborhood or homomorphic properties (sketches)
53. 07/10/16 MSc. Distributed Systems 53
Dynamic Graph Processing (Contd.)
Jane Joe
knows
#Tesla
posts
likes
Self driving cars
Ads
Peter Taphouse
checks-in
John
subscribes
Dinner Offer
Ads
Peter Jane
loves
loves
Self driving cars
Ads
54. 07/10/16 MSc. Distributed Systems 54
Stream Connected Components
● State: a disjoint set data structure for the
components
● Computation: For each edge
– if seen for the 1st time, create a component with ID
the min of the vertex IDs
– if in different components, merge them and update
the component ID to the min of the component IDs
– if only one of the endpoints belongs to a
component, add the other one to the same
component
67. 07/10/16 MSc. Distributed Systems 67
Streaming Graph Processing
[Current Work]
● We're working with Gelly-Streams on
– Preserving natural properties in large scale real-
world evolving graphs
– Joining multiple streams for detects graph
causality/ bipartite
– Efficient graph partitioning mechanisms to on-
board with popular data-stores like Cassandra,
HDFS
– Producing a platform to benchmark NPC
problems in real-world graphs