Jennifer Rawlins
Real Time Streaming with Kafka - for the data scientist
About Me
Jenn Rawlins has been creating software solutions for 19 years. She began her
career at Microsoft as an engineer in test, and as an international program
manager. This was followed by management and consultant roles, working with
VPs and Directors across multiple industries to create custom software solutions.
She then changed her focus to software engineering roles.
Recently Jenn has created Big Data solutions using Hadoop, Yarn, Kafka, and
Cassandra, writing real time streaming solutions in Java and Scala. Her current
focus is a solution in AWS for IoT devices.
AGENDA
❖ Messaging Systems
❖ Kafka
❖ SparkR
❖ Data Processing Pipelines
What is a message queueing system
Messages are sent to a queue. Messages are read from a queue. The queue is independent of the
senders or receivers (Publishers/Subscribers or Producers/Consumers). Fast, Predictable, easy to scale.
Cloud solutions
Amazon SQS - Simple Queue Service
Azure service bus
Server Solutions
Kafka
IBM WebSphere MQ
RabbitMQ
Kafka
LinkedIn uses Apache Kafka as a central publish-subscribe log for integrating data
between applications, stream processing, and Hadoop data ingestion.
REAL-TIME STREAMING
1. Data pipelines that reliably get data between systems or applications.
2. Applications to transform or react to streams of data.
Real Time Process streams of records as they occur. Data in, Data out.
Fault Tolerant Store streams of records in a fault-tolerant way.
Highly Scalable (Horizontal) Nodes can be added and removed from a
Kafka Cluster and the cluster will rebalance itself. High Availability begins at 5 Nodes.
Ordering guaranteed within a partition as it was received
Parallel processing of partitioned topics
Multi publisher (producer) - kafka writes message as received to a specific topic,
balancing across multiple partitions.
Multi subscriber (consumer) - Partitions assigned to specific subscriber.
Producer
Producer
Producer
ProducerProducer
Producer
Consumer
Consumer
ConsumerConsumerConsumer
Consumer
Kafka
Producer
Consumer Consumer
Kafka
Cluster
Producer Producer
Consumer
Record consists of a key, a value, and a timestamp. (message)
Topic kafka stores streams of records in categories called topics.
Cluster Kafka is run as a cluster on one or more servers.
Broker The actual server, and synchronization layer between server instances.
Node The logical kafka entity or ‘worker’ on each server.
Publish and subscribe to streams of records. Similar to a message queue or
enterprise messaging system.
Publish and Consume streams of records.
Process streams of records efficiently and in real time.
Store streams of records safely in a distributed, replicated cluster. Fault Tolerant.
A Stream is an unbounded, continuously updating data set. A stream is an
ordered, replayable, and fault-tolerant sequence of immutable data records.
A Stream DSL is stateful, and is a processor topology.
# Example: a record stream for page view events
1 => {"time":1440557383335, "user_id":1, "url":"/home?user=1"}
5 => {"time":1440557383345, "user_id":5, "url":"/home?user=5"}
2 => {"time":1440557383456, "user_id":2, "url":"/profile?user=2"}
1 => {"time":1440557385365, "user_id":1, "url":"/profile?user=1"}
Typical Use Cases
Message Broker ActiveMQ or RabbitMQ
Website Activity Tracking
Metrics - monitoring
Log Aggregation
Stream Processing
Website Activity Tracking
TRACKING- Web Site Activity
Add clicks
page views,
searches,
or other actions users may take
Record of each activity is published to central topics, with one topic per activity
type.
Application
Connector
RealTime
Processor
Application Application
Connector
Kafka
Cluster
Data
Store
Data
Store
Application
Producers (write)
Data
Store
Processor
Real Time
Consumers (read)
Connectors Stream
Processors
User
Action
Platforms Spark runs on Hadoop Yarn, Apache Mesos, in Standalone cluster
mode, or in the on EC2.
Languages Can be used from Scala, Python, and R shells
Processing optimizes jobs running on Hadoop in memory by 100x, or 10X
faster on disk.
R limitations
R is a popular statistical programming language used for data processing and
machine learning tasks.
Data Analysis is usually limited to a single thread, and the memory available on a
single computer.
Developed at the AMPLab, it was accepted and merged into Spark version 1.4
Provides an R frontend to apache Spark
Uses the Sparks data sources API to read from a variety of sources:
Hive(Hadoop), Json Files, Parquet Files.
Uses Spark’s distributed computation engine to run large scale data analysis from
the R shell on a cluster: Many Cores, Many Machines.
SparkDataFrame (distributed collection of data organized in named columns)
inherit optimizations from the computation engine.
SparkR: R package for Apache Spark
MLib and SparkR
Machine Learning algorithms currently supported:
Generalized Linear Model
Accelerated Failure Time (AFT)
Survival Regression Model
Naive Bayes Model
KMeans Model
Real Time Record Processing
Example Real Time Scenario: Serve up related ads to user that are more likely to
be clicked
Kafka Data Stream Spark StreamingWebsite
User Clicks Ad
Record added to
AdClick Topic
AdClick run Ad
through model to
update predictive
score
Application
Log Click Record
Use AdClick to
find related ads
to serve to user
using predictive
scoring.
Display New
Ads to User
Real-time process user data using an R model in a Spark job.
Batch process data from Kafka, Hadoop HDFS, SQL, Cassandra, HBase
Model Training multiple times with SparkR from multiple data sources
Historical Record Batch Processing
SparkRKafka Data Streams
AdClick
HomePageView
Spark job
AdClick topic: run
recent records
through model
RSpark & SparkHadoop
Hive AdClick model
training on historical
data
Cassandra
SQL
Pull Topics to create
stores of data for
many related
features
AdView
Kafka
Topic
Language
Kafka is written in Java
In Kafka the communication between the clients and the servers is done with a simple, high-performance,
language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with
older version. We provide a Java client for Kafka, but clients are available in many languages0
. Java
C/C++
Python
Go (AKA golang)
Erlang
.NET
Clojure
Ruby
Node.js
Proxy (HTTP REST, etc)
Perl
stdin/stdout
PHP
Rust
Alternative Java
Storm
Scala DSL
Clojure
Kafka http://kafka.apache.org/
Free and Open Source Software under the Apache License
Github code repo: https://github.com/apache/kafka
Confluent http://www.confluent.io/
Open Source offering Consulting, Training, Support, Monitoring Tools
Confluent Docs: http://docs.confluent.io/3.0.0/streams/developer-guide.html
Examples: https://github.com/confluentinc/examples/tree/kafka-0.10.0.0-cp-3.0.0/kafka-
streams/src/main/java/io/confluent/examples/streams
Resources
LinkedIn Story:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-
about-real-time-datas-unifying
Benchmarking:
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-
cheap-machines
SparkR
https://spark.apache.org/docs/latest/sparkr.html
https://databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
https://cs.stanford.edu/~matei/papers/2016/sigmod_sparkr.pdf

Kafka for data scientists

  • 1.
    Jennifer Rawlins Real TimeStreaming with Kafka - for the data scientist
  • 2.
    About Me Jenn Rawlinshas been creating software solutions for 19 years. She began her career at Microsoft as an engineer in test, and as an international program manager. This was followed by management and consultant roles, working with VPs and Directors across multiple industries to create custom software solutions. She then changed her focus to software engineering roles. Recently Jenn has created Big Data solutions using Hadoop, Yarn, Kafka, and Cassandra, writing real time streaming solutions in Java and Scala. Her current focus is a solution in AWS for IoT devices.
  • 3.
    AGENDA ❖ Messaging Systems ❖Kafka ❖ SparkR ❖ Data Processing Pipelines
  • 4.
    What is amessage queueing system Messages are sent to a queue. Messages are read from a queue. The queue is independent of the senders or receivers (Publishers/Subscribers or Producers/Consumers). Fast, Predictable, easy to scale. Cloud solutions Amazon SQS - Simple Queue Service Azure service bus Server Solutions Kafka IBM WebSphere MQ RabbitMQ
  • 5.
    Kafka LinkedIn uses ApacheKafka as a central publish-subscribe log for integrating data between applications, stream processing, and Hadoop data ingestion. REAL-TIME STREAMING 1. Data pipelines that reliably get data between systems or applications. 2. Applications to transform or react to streams of data.
  • 6.
    Real Time Processstreams of records as they occur. Data in, Data out. Fault Tolerant Store streams of records in a fault-tolerant way. Highly Scalable (Horizontal) Nodes can be added and removed from a Kafka Cluster and the cluster will rebalance itself. High Availability begins at 5 Nodes.
  • 7.
    Ordering guaranteed withina partition as it was received Parallel processing of partitioned topics Multi publisher (producer) - kafka writes message as received to a specific topic, balancing across multiple partitions. Multi subscriber (consumer) - Partitions assigned to specific subscriber.
  • 8.
  • 9.
    Record consists ofa key, a value, and a timestamp. (message) Topic kafka stores streams of records in categories called topics. Cluster Kafka is run as a cluster on one or more servers. Broker The actual server, and synchronization layer between server instances. Node The logical kafka entity or ‘worker’ on each server. Publish and subscribe to streams of records. Similar to a message queue or enterprise messaging system.
  • 10.
    Publish and Consumestreams of records. Process streams of records efficiently and in real time. Store streams of records safely in a distributed, replicated cluster. Fault Tolerant.
  • 11.
    A Stream isan unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records. A Stream DSL is stateful, and is a processor topology. # Example: a record stream for page view events 1 => {"time":1440557383335, "user_id":1, "url":"/home?user=1"} 5 => {"time":1440557383345, "user_id":5, "url":"/home?user=5"} 2 => {"time":1440557383456, "user_id":2, "url":"/profile?user=2"} 1 => {"time":1440557385365, "user_id":1, "url":"/profile?user=1"}
  • 12.
    Typical Use Cases MessageBroker ActiveMQ or RabbitMQ Website Activity Tracking Metrics - monitoring Log Aggregation Stream Processing
  • 13.
    Website Activity Tracking TRACKING-Web Site Activity Add clicks page views, searches, or other actions users may take Record of each activity is published to central topics, with one topic per activity type.
  • 14.
  • 15.
    Platforms Spark runson Hadoop Yarn, Apache Mesos, in Standalone cluster mode, or in the on EC2. Languages Can be used from Scala, Python, and R shells Processing optimizes jobs running on Hadoop in memory by 100x, or 10X faster on disk.
  • 16.
    R limitations R isa popular statistical programming language used for data processing and machine learning tasks. Data Analysis is usually limited to a single thread, and the memory available on a single computer.
  • 17.
    Developed at theAMPLab, it was accepted and merged into Spark version 1.4 Provides an R frontend to apache Spark Uses the Sparks data sources API to read from a variety of sources: Hive(Hadoop), Json Files, Parquet Files. Uses Spark’s distributed computation engine to run large scale data analysis from the R shell on a cluster: Many Cores, Many Machines. SparkDataFrame (distributed collection of data organized in named columns) inherit optimizations from the computation engine. SparkR: R package for Apache Spark
  • 18.
    MLib and SparkR MachineLearning algorithms currently supported: Generalized Linear Model Accelerated Failure Time (AFT) Survival Regression Model Naive Bayes Model KMeans Model
  • 19.
    Real Time RecordProcessing Example Real Time Scenario: Serve up related ads to user that are more likely to be clicked Kafka Data Stream Spark StreamingWebsite User Clicks Ad Record added to AdClick Topic AdClick run Ad through model to update predictive score Application Log Click Record Use AdClick to find related ads to serve to user using predictive scoring. Display New Ads to User
  • 20.
    Real-time process userdata using an R model in a Spark job. Batch process data from Kafka, Hadoop HDFS, SQL, Cassandra, HBase Model Training multiple times with SparkR from multiple data sources
  • 21.
    Historical Record BatchProcessing SparkRKafka Data Streams AdClick HomePageView Spark job AdClick topic: run recent records through model RSpark & SparkHadoop Hive AdClick model training on historical data Cassandra SQL Pull Topics to create stores of data for many related features AdView Kafka Topic
  • 22.
    Language Kafka is writtenin Java In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages0 . Java C/C++ Python Go (AKA golang) Erlang .NET Clojure Ruby Node.js Proxy (HTTP REST, etc) Perl stdin/stdout PHP Rust Alternative Java Storm Scala DSL Clojure
  • 23.
    Kafka http://kafka.apache.org/ Free andOpen Source Software under the Apache License Github code repo: https://github.com/apache/kafka Confluent http://www.confluent.io/ Open Source offering Consulting, Training, Support, Monitoring Tools Confluent Docs: http://docs.confluent.io/3.0.0/streams/developer-guide.html Examples: https://github.com/confluentinc/examples/tree/kafka-0.10.0.0-cp-3.0.0/kafka- streams/src/main/java/io/confluent/examples/streams
  • 25.

Editor's Notes

  • #10 Topics are partitioned and replicated across brokers for fault tolerance.
  • #24 Founded September 23, 2014