Data Streaming Technology Overview

Data Streaming Technology
Overview
Dan Lynn, CEO
CodeFutures
www.codefutures.com
@danklynn

Agenda
¨  What is stream processing?
¨  Why should you care?
¨  Architecture Patterns
¨  Technologies
¨  Q&A

What is Stream Processing?
Streaming
Complex Event Processing
Event Sourcing
Change Data Capture
Reactive programming
Oh my!

What is Stream Processing?
Generally speaking, streaming is…
processing events
in the order they occur.

Because time happens in
order.

Databases already know this.
Master SlaveReplicationBinary log
files
Binary log
files
Binary log
files

ter Slave
Binary log
files
Binary log
files
Binary log
files

Binary log
files
Binary log
files
Binary log
files

Binary log
files
Binary log
files
Binary log
files
0 1 2 3 4 5 6 7 8 9 . . .
INSERT INTO employees …
UPDATE employees
SET salary = 1 MILLION DOLLARS

Replication is streaming
Master Slave
Binary log
files
Binary log
files
Binary log
files

What if you could leverage this?
Master
Binary log
files
Binary log
files
Binary log
files
Message
Queues
Web Servers
Devices
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files

What if you could leverage this?
Master
Binary log
files
Binary log
files
Binary log
files
Message
Queues
Web Servers
Devices
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Stream Processing

Lambda Architecture
Binary log
files
Binary log
files
events
Batch Computation
(hadoop)
Speed (real-time) Layer
Serving Layer

Kappa Architecture
Binary log
files
Binary log
files
events
Batch Computation
(hadoop)
Speed (real-time) Layer
Serving Layer
Replay-able Log

RabbitMQ
¨  http://www.rabbitmq.com/
¨  History
¤  Long history of successful deployments
¨  How it works
¤  Erlang-based implementation of the Advanced Message
Queuing Protocol (AMQP).
¤  Scales up well, but difficult to scale-out.

Apache Kafka
¨  http://kafka.apache.org/
¨  History
¤  Developed at LinkedIn. Healthy developer community.
¤  Built on the log-oriented processing concept.
¤  Partitions topics using a leader-follower model,
coordinated by Zookeeper.
¤  Low level and mid-level processing API.

Apache Storm
¨  https://storm.apache.org/
¨  History
¤  Created by Nathan Marz et al. while working at
Twitter.
¤  Processes events in parallel from a variety of sources.
¤  “At-least-once” reliability semantics.
¤  Structures processing as a graph of incremental
processing called a Topology.
¤  Blend of Java and Clojure

Apache Storm
¨  Gotchas
¤  Debugging can become complex.
¤  Finicky configuration
¨  Neat integrations
¤  Summingbird – Advanced analytics
¤  Trident – Cleaner API, stateful processing
¨  See also:
¤  http://www.slideshare.net/DanLynn1/storm-as-deep-into-
realtime-data-processing-as-you-can-get-in-30-minutes

Apache Samza
¨  http://samza.apache.org/
¨  History
¤  Originated at LinkedIn. Active development by
Confluent.
¤  Built on top of Kafka and YARN
¤  Provides a clean streaming API to logically connect
different streams of data
¤  Blend of Java and Scala

Apache Spark (Streaming)
¨  http://spark.apache.org/
¨  History
¤  Originated at UC Berkley AMPLAB.
¤  Very active community.
¤  Structures processing as a DAG of parallel computation.
¤  Can run on YARN or Mesos
¤  Stream processing is supported using small “micro-
batches” (~1 second)
¤  Developed in Scala. Java and Python supported.

LinkedIn Databus
¨  https://github.com/linkedin/databus
¨  History
¤  Originated at LinkedIn. Precursor to Kafka.
¤  Based of of “Database Log mining” concept.
¤  Follows the replication log of source database(s)
¤  Uses “Data Relays” to replicate and transform data into
other repositories.

AgilData (obligatory sales slide)
¨  http://www.agildata.com/
¨  History
¤  Successor to dbShards, CodeFutures successful relational
sharding product.
¤  Extends dbShards high performance replication &
partitioning infrastructure to handle stream processing.
¤  Very simple SQL-based API (w/ pluggable UDFs)
¤  Advanced execution planner.

Hailstorm
¨  https://github.com/hailstorm-hs/hailstorm
¨  History
¤  Haskell-based improvement upon Storm.
¤  Improves the implementation of exactly-once semantics in
Storm by leveraging Kafka and Zookeeper.
¨  Very early but promising.

dan.lynn@codefutures.com
@danklynn
Thanks!

Data Streaming Technology Overview

More Related Content

What's hot

Viewers also liked

Similar to Data Streaming Technology Overview

More from Dan Lynn

Recently uploaded

Data Streaming Technology Overview