Data Streaming Technology
Overview
Dan Lynn, CEO
CodeFutures
www.codefutures.com
@danklynn
Agenda
¨  What is stream processing?
¨  Why should you care?
¨  Architecture Patterns
¨  Technologies
¨  Q&A
What is Stream Processing?
Streaming
Complex Event Processing
Event Sourcing
Change Data Capture
Reactive programming
Oh my!
What is Stream Processing?
Generally speaking, streaming is…
processing events
in the order they occur.
Ok, why should you care?
Because time happens in
order.
Databases already know this.
Master SlaveReplicationBinary log
files
Binary log
files
Binary log
files
ter Slave
Binary log
files
Binary log
files
Binary log
files
Databases already know this.
Databases already know this.
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
0 1 2 3 4 5 6 7 8 9 . . .
INSERT INTO employees …
UPDATE employees
SET salary = 1 MILLION DOLLARS
Databases already know this.
Replication is streaming
Master Slave
Binary log
files
Binary log
files
Binary log
files
What if you could leverage this?
Master
Binary log
files
Binary log
files
Binary log
files
Message
Queues
Web Servers
Devices
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
What if you could leverage this?
Master
Binary log
files
Binary log
files
Binary log
files
Message
Queues
Web Servers
Devices
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
What if you could leverage this?
Master
Binary log
files
Binary log
files
Binary log
files
Message
Queues
Web Servers
Devices
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Binary log
files
Stream Processing
Architecture Patterns
Lambda Architecture
Binary log
files
Binary log
files
events
Batch Computation
(hadoop)
Speed (real-time) Layer
Serving Layer
Kappa Architecture
Binary log
files
Binary log
files
events
Batch Computation
(hadoop)
Speed (real-time) Layer
Serving Layer
Replay-able Log
Technologies
RabbitMQ
¨  http://www.rabbitmq.com/
¨  History
¤  Long history of successful deployments
¨  How it works
¤  Erlang-based implementation of the Advanced Message
Queuing Protocol (AMQP).
¤  Scales up well, but difficult to scale-out.
Apache Kafka
¨  http://kafka.apache.org/
¨  History
¤  Developed at LinkedIn. Healthy developer community.
¨  How it works
¤  Built on the log-oriented processing concept.
¤  Partitions topics using a leader-follower model,
coordinated by Zookeeper.
¤  Low level and mid-level processing API.
Apache Storm
¨  https://storm.apache.org/
¨  History
¤  Created by Nathan Marz et al. while working at
Twitter.
¨  How it works
¤  Processes events in parallel from a variety of sources.
¤  “At-least-once” reliability semantics.
¤  Structures processing as a graph of incremental
processing called a Topology.
¤  Blend of Java and Clojure
Apache Storm
¨  Gotchas
¤  Debugging can become complex.
¤  Finicky configuration
¨  Neat integrations
¤  Summingbird – Advanced analytics
¤  Trident – Cleaner API, stateful processing
¨  See also:
¤  http://www.slideshare.net/DanLynn1/storm-as-deep-into-
realtime-data-processing-as-you-can-get-in-30-minutes
Apache Samza
¨  http://samza.apache.org/
¨  History
¤  Originated at LinkedIn. Active development by
Confluent.
¨  How it works
¤  Built on top of Kafka and YARN
¤  Provides a clean streaming API to logically connect
different streams of data
¤  Blend of Java and Scala
Apache Spark (Streaming)
¨  http://spark.apache.org/
¨  History
¤  Originated at UC Berkley AMPLAB.
¤  Very active community.
¨  How it works
¤  Structures processing as a DAG of parallel computation.
¤  Can run on YARN or Mesos
¤  Stream processing is supported using small “micro-
batches” (~1 second)
¤  Developed in Scala. Java and Python supported.
LinkedIn Databus
¨  https://github.com/linkedin/databus
¨  History
¤  Originated at LinkedIn. Precursor to Kafka.
¨  How it works
¤  Based of of “Database Log mining” concept.
¤  Follows the replication log of source database(s)
¤  Uses “Data Relays” to replicate and transform data into
other repositories.
AgilData (obligatory sales slide)
¨  http://www.agildata.com/
¨  History
¤  Successor to dbShards, CodeFutures successful relational
sharding product.
¨  How it works
¤  Extends dbShards high performance replication &
partitioning infrastructure to handle stream processing.
¤  Very simple SQL-based API (w/ pluggable UDFs)
¤  Advanced execution planner.
Hailstorm
¨  https://github.com/hailstorm-hs/hailstorm
¨  History
¤  Haskell-based improvement upon Storm.
¨  How it works
¤  Improves the implementation of exactly-once semantics in
Storm by leveraging Kafka and Zookeeper.
¨  Very early but promising.
dan.lynn@codefutures.com
@danklynn
Thanks!

Data Streaming Technology Overview

  • 1.
    Data Streaming Technology Overview DanLynn, CEO CodeFutures www.codefutures.com @danklynn
  • 2.
    Agenda ¨  What isstream processing? ¨  Why should you care? ¨  Architecture Patterns ¨  Technologies ¨  Q&A
  • 3.
    What is StreamProcessing? Streaming Complex Event Processing Event Sourcing Change Data Capture Reactive programming Oh my!
  • 4.
    What is StreamProcessing? Generally speaking, streaming is… processing events in the order they occur.
  • 5.
    Ok, why shouldyou care?
  • 6.
  • 7.
    Databases already knowthis. Master SlaveReplicationBinary log files Binary log files Binary log files
  • 8.
    ter Slave Binary log files Binarylog files Binary log files Databases already know this.
  • 9.
    Databases already knowthis. Binary log files Binary log files Binary log files
  • 10.
    Binary log files Binary log files Binarylog files 0 1 2 3 4 5 6 7 8 9 . . . INSERT INTO employees … UPDATE employees SET salary = 1 MILLION DOLLARS Databases already know this.
  • 11.
    Replication is streaming MasterSlave Binary log files Binary log files Binary log files
  • 12.
    What if youcould leverage this? Master Binary log files Binary log files Binary log files Message Queues Web Servers Devices Binary log files Binary log files Binary log files Binary log files Binary log files Binary log files
  • 13.
    What if youcould leverage this? Master Binary log files Binary log files Binary log files Message Queues Web Servers Devices Binary log files Binary log files Binary log files Binary log files Binary log files Binary log files
  • 14.
    What if youcould leverage this? Master Binary log files Binary log files Binary log files Message Queues Web Servers Devices Binary log files Binary log files Binary log files Binary log files Binary log files Binary log files Stream Processing
  • 15.
  • 16.
    Lambda Architecture Binary log files Binarylog files events Batch Computation (hadoop) Speed (real-time) Layer Serving Layer
  • 17.
    Kappa Architecture Binary log files Binarylog files events Batch Computation (hadoop) Speed (real-time) Layer Serving Layer Replay-able Log
  • 18.
  • 19.
    RabbitMQ ¨  http://www.rabbitmq.com/ ¨  History ¤ Long history of successful deployments ¨  How it works ¤  Erlang-based implementation of the Advanced Message Queuing Protocol (AMQP). ¤  Scales up well, but difficult to scale-out.
  • 20.
    Apache Kafka ¨  http://kafka.apache.org/ ¨ History ¤  Developed at LinkedIn. Healthy developer community. ¨  How it works ¤  Built on the log-oriented processing concept. ¤  Partitions topics using a leader-follower model, coordinated by Zookeeper. ¤  Low level and mid-level processing API.
  • 21.
    Apache Storm ¨  https://storm.apache.org/ ¨ History ¤  Created by Nathan Marz et al. while working at Twitter. ¨  How it works ¤  Processes events in parallel from a variety of sources. ¤  “At-least-once” reliability semantics. ¤  Structures processing as a graph of incremental processing called a Topology. ¤  Blend of Java and Clojure
  • 22.
    Apache Storm ¨  Gotchas ¤ Debugging can become complex. ¤  Finicky configuration ¨  Neat integrations ¤  Summingbird – Advanced analytics ¤  Trident – Cleaner API, stateful processing ¨  See also: ¤  http://www.slideshare.net/DanLynn1/storm-as-deep-into- realtime-data-processing-as-you-can-get-in-30-minutes
  • 23.
    Apache Samza ¨  http://samza.apache.org/ ¨ History ¤  Originated at LinkedIn. Active development by Confluent. ¨  How it works ¤  Built on top of Kafka and YARN ¤  Provides a clean streaming API to logically connect different streams of data ¤  Blend of Java and Scala
  • 24.
    Apache Spark (Streaming) ¨ http://spark.apache.org/ ¨  History ¤  Originated at UC Berkley AMPLAB. ¤  Very active community. ¨  How it works ¤  Structures processing as a DAG of parallel computation. ¤  Can run on YARN or Mesos ¤  Stream processing is supported using small “micro- batches” (~1 second) ¤  Developed in Scala. Java and Python supported.
  • 25.
    LinkedIn Databus ¨  https://github.com/linkedin/databus ¨ History ¤  Originated at LinkedIn. Precursor to Kafka. ¨  How it works ¤  Based of of “Database Log mining” concept. ¤  Follows the replication log of source database(s) ¤  Uses “Data Relays” to replicate and transform data into other repositories.
  • 26.
    AgilData (obligatory salesslide) ¨  http://www.agildata.com/ ¨  History ¤  Successor to dbShards, CodeFutures successful relational sharding product. ¨  How it works ¤  Extends dbShards high performance replication & partitioning infrastructure to handle stream processing. ¤  Very simple SQL-based API (w/ pluggable UDFs) ¤  Advanced execution planner.
  • 27.
    Hailstorm ¨  https://github.com/hailstorm-hs/hailstorm ¨  History ¤ Haskell-based improvement upon Storm. ¨  How it works ¤  Improves the implementation of exactly-once semantics in Storm by leveraging Kafka and Zookeeper. ¨  Very early but promising.
  • 28.