Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Spark Summit
Spark SummitSpark Summit
Streaming Analytics with Spark,
Kafka, Cassandra, and Akka
Helena Edelson
VP of Product Engineering @Tuplejump
• Committer / Contributor: Akka, FiloDB, Spark Cassandra
Connector, Spring Integration
• VP of Product Engineering @Tuplejump
• Previously: Sr Cloud Engineer / Architect at VMware,
CrowdStrike, DataStax and SpringSource
Who
@helenaedelson
github.com/helena
Tuplejump Open Source
github.com/tuplejump
• FiloDB - distributed, versioned, columnar analytical db for modern
streaming workloads
• Calliope - the first Spark-Cassandra integration
• Stargate - Lucene indexer for Cassandra
• SnackFS - HDFS-compatible file system for Cassandra
3
What Will We Talk About
• The Problem Domain
• Example Use Case
• Rethinking Architecture
– We don't have to look far to look back
– Streaming
– Revisiting the goal and the stack
– Simplification
THE PROBLEM DOMAIN
Delivering Meaning From A Flood Of Data
5
The Problem Domain
Need to build scalable, fault tolerant, distributed data
processing systems that can handle massive amounts of
data from disparate sources, with different data structures.
6
Translation
How to build adaptable, elegant systems
for complex analytics and learning tasks
to run as large-scale clustered dataflows
7
How Much Data
Yottabyte = quadrillion gigabytes or septillion
bytes
8
We all have a lot of data
• Terabytes
• Petabytes...
https://en.wikipedia.org/wiki/Yottabyte
Delivering Meaning
• Deliver meaning in sec/sub-sec latency
• Disparate data sources & schemas
• Billions of events per second
• High-latency batch processing
• Low-latency stream processing
• Aggregation of historical from the stream
While We Monitor, Predict & Proactively Handle
• Massive event spikes
• Bursty traffic
• Fast producers / slow consumers
• Network partitioning & Out of sync systems
• DC down
• Wait, we've DDOS'd ourselves from fast streams?
• Autoscale issues
– When we scale down VMs how do we not loose data?
And stay within our
AWS / Rackspace budget
EXAMPLE CASE:
CYBER SECURITY
Hunting The Hunter
12
13
• Track activities of international threat actor groups,
nation-state, criminal or hactivist
• Intrusion attempts
• Actual breaches
• Profile adversary activity
• Analysis to understand their motives, anticipate actions
and prevent damage
Adversary Profiling & Hunting
14
• Machine events
• Endpoint intrusion detection
• Anomalies/indicators of attack or compromise
• Machine learning
• Training models based on patterns from historical data
• Predict potential threats
• profiling for adversary Identification
•
Stream Processing
Data Requirements & Description
• Streaming event data
• Log messages
• User activity records
• System ops & metrics data
• Disparate data sources
• Wildly differing data structures
15
Massive Amounts Of Data
16
• One machine can generate 2+ TB per day
• Tracking millions of devices
• 1 million writes per second - bursty
• High % writes, lower % reads
• TTL
RETHINKING
ARCHITECTURE
17
WE DON'T HAVE TO LOOK
FAR TO LOOK BACK
18
Rethinking Architecture
19
Most batch analytics flow from
several years ago looked like...
STREAMING & DATA SCIENCE
20
Rethinking Architecture
Streaming
I need fast access to historical data on the fly for
predictive modeling with real time data from the stream.
21
Not A Stream, A Flood
• Data emitters
• Netflix: 1 - 2 million events per second at peak
• 750 billion events per day
• LinkedIn: > 500 billion events per day
• Data ingesters
• Netflix: 50 - 100 billion events per day
• LinkedIn: 2.5 trillion events per day
• 1 Petabyte of streaming data
22
Which Translates To
• Do it fast
• Do it cheap
• Do it at scale
23
Challenges
• Code changes at runtime
• Distributed Data Consistency
• Ordering guarantees
• Complex compute algorithms
24
Oh, and don't lose data
25
Strategies
• Partition For Scale & Data Locality
• Replicate For Resiliency
• Share Nothing
• Fault Tolerance
• Asynchrony
• Async Message Passing
• Memory Management
26
• Data lineage and reprocessing in
runtime
• Parallelism
• Elastically Scale
• Isolation
• Location Transparency
AND THEN WE GREEKED OUT
27
Rethinking Architecture
Lambda Architecture
A data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and
stream processing methods.
28
Lambda Architecture
A data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and
stream processing methods.
• An approach
• Coined by Nathan Marz
• This was a huge stride forward
29
30
https://www.mapr.com/developercentral/lambda-architecture
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Implementing Is Hard
32
• Real-time pipeline backed by KV store for updates
• Many moving parts - KV store, real time, batch
• Running similar code in two places
• Still ingesting data to Parquet/HDFS
• Reconcile queries against two different places
Performance Tuning & Monitoring:
on so many systems
33
Also hard
Lambda Architecture
An immutable sequence of records is captured and fed
into a batch system and a stream processing
system in parallel.
34
WAIT, DUAL SYSTEMS?
35
Challenge Assumptions
Which Translates To
• Performing analytical computations & queries in dual
systems
• Implementing transformation logic twice
• Duplicate Code
• Spaghetti Architecture for Data Flows
• One Busy Network
36
Why Dual Systems?
• Why is a separate batch system needed?
• Why support code, machines and running services of
two analytics systems?
37
Counter productive on some level?
YES
38
• A unified system for streaming and batch
• Real-time processing and reprocessing
• Code changes
• Fault tolerance
http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html - Jay Kreps
ANOTHER ASSUMPTION:
ETL
39
Challenge Assumptions
Extract, Transform, Load (ETL)
40
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
Extract, Transform, Load (ETL)
41
ETL involves
• Extraction of data from one system into another
• Transforming it
• Loading it into another system
Extract, Transform, Load (ETL)
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
42
Also unnecessarily redundant and often typeless
ETL
43
• Each ETL step can introduce errors and risk
• Can duplicate data after failover
• Tools can cost millions of dollars
• Decreases throughput
• Increased complexity
ETL
• Writing intermediary files
• Parsing and re-parsing plain text
44
And let's duplicate the pattern
over all our DataCenters
45
46
These are not the solutions you're looking for
REVISITING THE GOAL
& THE STACK
47
Removing The 'E' in ETL
Thanks to technologies like Avro and Protobuf we don’t need the
“E” in ETL. Instead of text dumps that you need to parse over
multiple systems:
Scala & Avro (e.g.)

• Can work with binary data that remains strongly typed

• A return to strong typing in the big data ecosystem
48
Removing The 'L' in ETL
If data collection is backed by a distributed messaging
system (e.g. Kafka) you can do real-time fanout of the
ingested data to all consumers. No need to batch "load".
• From there each consumer can do their own transformations
49
#NoMoreGreekLetterArchitectures
50
NoETL
51
Strategy Technologies
Scalable Infrastructure / Elastic Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
My Nerdy Chart
52
SMACK
• Scala/Spark
• Mesos
• Akka
• Cassandra
• Kafka
53
Spark Streaming
54
Spark Streaming
• One runtime for streaming and batch processing
• Join streaming and static data sets
• No code duplication
• Easy, flexible data ingestion from disparate sources to
disparate sinks
• Easy to reconcile queries against multiple sources
• Easy integration of KV durable storage
55
How do I merge historical data with data
in the stream?
56
Join Streams With Static Data
val ssc = new StreamingContext(conf, Milliseconds(500))
ssc.checkpoint("checkpoint")
val staticData: RDD[(Int,String)] =
ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func)
val stream: DStream[(Int,String)] =
KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n))
.transform { events => events.join(staticData))
.saveToCassandra(keyspace,table)
ssc.start()
57
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test Data
Your Data Extract Data To Analyze
Train your model to predict
Spark MLLib
58
Spark Streaming & ML
59
val context = new StreamingContext(conf, Milliseconds(500))
val model = KMeans.train(dataset, ...) // learn offline
val stream = KafkaUtils
.createStream(ssc, zkQuorum, group,..)
.map(event => model.predict(event.feature))
Apache Mesos
Open-source cluster manager developed at UC Berkeley.
Abstracts CPU, memory, storage, and other compute resources
away from machines (physical or virtual), enabling fault-tolerant
and elastic distributed systems to easily be built and run
effectively.
60
Akka
High performance concurrency framework for Scala and
Java
• Fault Tolerance
• Asynchronous messaging and data processing
• Parallelization
• Location Transparency
• Local / Remote Routing
• Akka: Cluster / Persistence / Streams
61
Akka Actors
A distribution and concurrency abstraction
• Compute Isolation
• Behavioral Context Switching
• No Exposed Internal State
• Event-based messaging
• Easy parallelism
• Configurable fault tolerance
62
63
Akka Actor Hierarchy
http://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala
import akka.actor._
class NodeGuardianActor(args...) extends Actor with SupervisorStrategy {
val temperature = context.actorOf(
Props(new TemperatureActor(args)), "temperature")
val precipitation = context.actorOf(
Props(new PrecipitationActor(args)), "precipitation")
override def preStart(): Unit = { /* lifecycle hook: init */ }
def receive : Actor.Receive = {
case Initialized => context become initialized
}
def initialized : Actor.Receive = {
case e: SomeEvent => someFunc(e)
case e: OtherEvent => otherFunc(e)
}
}
64
Apache Cassandra
• Extremely Fast
• Extremely Scalable
• Multi-Region / Multi-Datacenter
• Always On
• No single point of failure
• Survive regional outages
• Easy to operate
• Automatic & configurable replication 65
Apache Cassandra
• Very flexible data modeling (collections, user defined
types) and changeable over time
• Perfect for ingestion of real time / machine data
• Huge community
66
Spark Cassandra Connector
• NOSQL JOINS!
• Write & Read data between Spark and Cassandra
• Compatible with Spark 1.4
• Handles Data Locality for Speed
• Implicit type conversions
• Server-Side Filtering - SELECT, WHERE, etc.
• Natural Timeseries Integration
67
http://github.com/datastax/spark-cassandra-connector
KillrWeather
68
http://github.com/killrweather/killrweather
A reference application showing how to easily integrate streaming and
batch data processing with Apache Spark Streaming, Apache
Cassandra, Apache Kafka and Akka for fast, streaming computations
on time series data in asynchronous event-driven environments.
http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/
databricks/apps/weather
69
• High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
Spark Streaming & Kafka
val context = new StreamingContext(conf, Seconds(1))
val wordCount = KafkaUtils.createStream(context, ...)
.flatMap(_.split(" "))
.map(x => (x, 1))
.reduceByKey(_ + _)
wordCount.saveToCassandra(ks,table)
context.start() // start receiving and computing
70
71
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))



kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)

/** RawWeatherData: wsid, year, month, day, oneHourPrecip */

kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)



/** Now the [[StreamingContext]] can be started. */

context.parent ! OutputStreamInitialized



def receive : Actor.Receive = {…}
}
Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
72
/** For a given weather station, calculates annual cumulative precip - or year to date. */

class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {



def receive : Actor.Receive = {

case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)

case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)

}



/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */

def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =

ssc.cassandraTable[Double](keyspace, dailytable)

.select("precipitation")

.where("wsid = ? AND year = ?", wsid, year)

.collectAsync()

.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester



/** Returns the 10 highest temps for any station in the `year`. */

def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {

val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,

ssc.sparkContext.parallelize(aggregate).top(k).toSeq)



ssc.cassandraTable[Double](keyspace, dailytable)

.select("precipitation")

.where("wsid = ? AND year = ?", wsid, year)

.collectAsync().map(toTopK) pipeTo requester

}

}
A New Approach
• One Runtime: streaming, scheduled
• Simplified architecture
• Allows us to
• Write different types of applications
• Write more type safe code
• Write more reusable code
73
Need daily analytics aggregate reports? Do it in the stream, save
results in Cassandra for easy reporting as needed - with data
locality not offered by S3.
FiloDB
Distributed, columnar database designed to run very fast
analytical queries
• Ingest streaming data from many streaming sources
• Row-level, column-level operations and built in versioning
offer greater flexibility than file-based technologies
• Currently based on Apache Cassandra & Spark
• github.com/tuplejump/FiloDB
75
FiloDB
• Breakthrough performance levels for analytical queries
• Performance comparable to Parquet
• One to two orders of magnitude faster than Spark on
Cassandra 2.x
• Versioned - critical for reprocessing logic/code changes
• Can simplify your infrastructure dramatically
• Queries run in parallel in Spark for scale-out ad-hoc analysis
• Space-saving techniques
76
WRAPPING UP
77
Architectyr?
78
"This is a giant mess"
- Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
79
Simplified
80
81
www.tuplejump.com
info@tuplejump.com@tuplejump
82
@helenaedelson
github.com/helena
slideshare.net/helenaedelson
THANK YOU!
I'm speaking at QCon SF on the broader
topic of Streaming at Scale
http://qconsf.com/sf2015/track/streaming-data-scale
83
1 of 83

Recommended

A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath... by
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
6.2K views67 slides
Tame the small files problem and optimize data layout for streaming ingestion... by
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
808 views68 slides
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S... by
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...HostedbyConfluent
4.6K views16 slides
Autoscaling Flink with Reactive Mode by
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
925 views17 slides
How to Automate Performance Tuning for Apache Spark by
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
1.8K views36 slides
Flexible and Real-Time Stream Processing with Apache Flink by
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
2.2K views38 slides

More Related Content

What's hot

Parquet performance tuning: the missing guide by
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
40.5K views44 slides
Evening out the uneven: dealing with skew in Flink by
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
2.5K views35 slides
Spark SQL Deep Dive @ Melbourne Spark Meetup by
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
9K views57 slides
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect... by
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
664 views18 slides
Flink vs. Spark by
Flink vs. SparkFlink vs. Spark
Flink vs. SparkSlim Baltagi
69.5K views67 slides
How to build a streaming Lakehouse with Flink, Kafka, and Hudi by
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
488 views16 slides

What's hot(20)

Parquet performance tuning: the missing guide by Ryan Blue
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue40.5K views
Evening out the uneven: dealing with skew in Flink by Flink Forward
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward2.5K views
Spark SQL Deep Dive @ Melbourne Spark Meetup by Databricks
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks9K views
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect... by HostedbyConfluent
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
HostedbyConfluent664 views
Flink vs. Spark by Slim Baltagi
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi69.5K views
How to build a streaming Lakehouse with Flink, Kafka, and Hudi by Flink Forward
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward488 views
Kafka for Real-Time Replication between Edge and Hybrid Cloud by Kai Wähner
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner1.4K views
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... by Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks8.4K views
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha... by HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent992 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Performance Monitoring: Understanding Your Scylla Cluster by ScyllaDB
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
ScyllaDB5.1K views
Using all of the high availability options in MariaDB by MariaDB plc
Using all of the high availability options in MariaDBUsing all of the high availability options in MariaDB
Using all of the high availability options in MariaDB
MariaDB plc2.9K views
Spark with Delta Lake by Knoldus Inc.
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.295 views
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli... by Flink Forward
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward699 views
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303) by Amazon Web Services
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
Amazon Web Services37.4K views
Iceberg + Alluxio for Fast Data Analytics by Alluxio, Inc.
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.530 views
Introduction to Apache Flink by datamantra
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra5.2K views
ksqlDB: Building Consciousness on Real Time Events by confluent
ksqlDB: Building Consciousness on Real Time EventsksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time Events
confluent779 views
Making Apache Spark Better with Delta Lake by Databricks
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks5.4K views
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap... by Flink Forward
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward3.2K views

Viewers also liked

Architecture Big Data open source S.M.A.C.K by
Architecture Big Data open source S.M.A.C.KArchitecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.KJulien Anguenot
3.3K views40 slides
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
83.3K views77 slides
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an... by
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
71.3K views30 slides
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization by
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
72.3K views28 slides
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S... by
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
86.2K views91 slides
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala by
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
75K views100 slides

Viewers also liked(20)

Architecture Big Data open source S.M.A.C.K by Julien Anguenot
Architecture Big Data open source S.M.A.C.KArchitecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.K
Julien Anguenot3.3K views
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by Helena Edelson
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson83.3K views
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an... by Anton Kirillov
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov71.3K views
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization by Patrick Di Loreto
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto72.3K views
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S... by Helena Edelson
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson86.2K views
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala by Helena Edelson
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson75K views
Reactive dashboard’s using apache spark by Rahul Kumar
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar37.5K views
Spark with Cassandra by Christopher Batey by Spark Summit
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
Spark Summit3.5K views
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark by Todd Fritz
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Todd Fritz2.9K views
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra by Natalino Busa
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa64.9K views
IBM Transforming Customer Relationships Through Predictive Analytics by SFIMA
IBM Transforming Customer Relationships Through Predictive AnalyticsIBM Transforming Customer Relationships Through Predictive Analytics
IBM Transforming Customer Relationships Through Predictive Analytics
SFIMA861 views
[若渴計畫2015.8.18] SMACK by Aj MaChInE
[若渴計畫2015.8.18] SMACK[若渴計畫2015.8.18] SMACK
[若渴計畫2015.8.18] SMACK
Aj MaChInE1.2K views
Jaws - Data Warehouse with Spark SQL by Ema Orhian by Spark Summit
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit2.7K views
Highlights and Challenges from Running Spark on Mesos in Production by Morri ... by Spark Summit
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit1.6K views
Production Readiness Testing At Salesforce Using Spark MLlib by Spark Summit
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
Spark Summit949 views
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar by Spark Summit
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit3.7K views
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an... by DataStax
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
DataStax999 views
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai... by Spark Summit
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit3.3K views
Spark Summit EU 2015: SparkUI visualization: a lens into your application by Databricks
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks6.5K views
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &... by DataStax
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
DataStax1.7K views

Similar to Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Streaming Analytics with Spark, Kafka, Cassandra and Akka by
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
52.1K views84 slides
Streaming Big Data & Analytics For Scale by
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
5.2K views77 slides
Sa introduction to big data pipelining with cassandra & spark west mins... by
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
34.4K views44 slides
Using the SDACK Architecture to Build a Big Data Product by
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
2K views67 slides
BBL KAPPA Lesfurets.com by
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comCedric Vidal
9 views22 slides
Spark and Couchbase: Augmenting the Operational Database with Spark by
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
781 views37 slides

Similar to Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson(20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson52.1K views
Streaming Big Data & Analytics For Scale by Helena Edelson
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
Helena Edelson5.2K views
Sa introduction to big data pipelining with cassandra & spark west mins... by Simon Ambridge
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge34.4K views
Using the SDACK Architecture to Build a Big Data Product by Evans Ye
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
Evans Ye2K views
Spark and Couchbase: Augmenting the Operational Database with Spark by Spark Summit
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit781 views
20160331 sa introduction to big data pipelining berlin meetup 0.3 by Simon Ambridge
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge389 views
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ... by Fwdays
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays515 views
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K... by Lviv Startup Club
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club1.4K views
Customer Education Webcast: New Features in Data Integration and Streaming CDC by Precisely
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Precisely111 views
Data Pipelines with Spark & DataStax Enterprise by DataStax
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax2.2K views
Building Event Streaming Architectures on Scylla and Kafka by ScyllaDB
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB866 views
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat... by DataStax Academy
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
DataStax Academy3.3K views
Cloud Lambda Architecture Patterns by Asis Mohanty
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty121 views
Big Data_Architecture.pptx by betalab
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab9 views

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
6.8K views32 slides
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
4.3K views34 slides
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
3.1K views21 slides
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
3.1K views21 slides
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
3.4K views56 slides
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
1.7K views90 slides

More from Spark Summit(20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit6.8K views
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit4.3K views
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit3.1K views
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit3.1K views
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit3.4K views
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit1.7K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit2.1K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit1.1K views
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library... by Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit2.8K views
Next CERN Accelerator Logging Service with Jakub Wozniak by Spark Summit
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit2.3K views
Powering a Startup with Apache Spark with Kevin Kim by Spark Summit
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit787 views
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit730 views
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—... by Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit1.1K views
How Nielsen Utilized Databricks for Large-Scale Research and Development with... by Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit999 views
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov... by Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit2.3K views
Goal Based Data Production with Sim Simeonov by Spark Summit
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit746 views
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le... by Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit1.5K views
Getting Ready to Use Redis with Apache Spark with Dvir Volk by Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit3K views
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit915 views
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization... by Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit1.7K views

Recently uploaded

CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
7 views24 slides
AvizoImageSegmentation.pptx by
AvizoImageSegmentation.pptxAvizoImageSegmentation.pptx
AvizoImageSegmentation.pptxnathanielbutterworth1
6 views14 slides
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...DataScienceConferenc1
5 views18 slides
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...DataScienceConferenc1
5 views18 slides
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptxDataScienceConferenc1
5 views21 slides
CRM stick or twist workshop by
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
12 views16 slides

Recently uploaded(20)

CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821712 views
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson29 views
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum5 views
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf by DataScienceConferenc1
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang7 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat431 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines by DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

  • 1. Streaming Analytics with Spark, Kafka, Cassandra, and Akka Helena Edelson VP of Product Engineering @Tuplejump
  • 2. • Committer / Contributor: Akka, FiloDB, Spark Cassandra Connector, Spring Integration • VP of Product Engineering @Tuplejump • Previously: Sr Cloud Engineer / Architect at VMware, CrowdStrike, DataStax and SpringSource Who @helenaedelson github.com/helena
  • 3. Tuplejump Open Source github.com/tuplejump • FiloDB - distributed, versioned, columnar analytical db for modern streaming workloads • Calliope - the first Spark-Cassandra integration • Stargate - Lucene indexer for Cassandra • SnackFS - HDFS-compatible file system for Cassandra 3
  • 4. What Will We Talk About • The Problem Domain • Example Use Case • Rethinking Architecture – We don't have to look far to look back – Streaming – Revisiting the goal and the stack – Simplification
  • 5. THE PROBLEM DOMAIN Delivering Meaning From A Flood Of Data 5
  • 6. The Problem Domain Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures. 6
  • 7. Translation How to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows 7
  • 8. How Much Data Yottabyte = quadrillion gigabytes or septillion bytes 8 We all have a lot of data • Terabytes • Petabytes... https://en.wikipedia.org/wiki/Yottabyte
  • 9. Delivering Meaning • Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream
  • 10. While We Monitor, Predict & Proactively Handle • Massive event spikes • Bursty traffic • Fast producers / slow consumers • Network partitioning & Out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues – When we scale down VMs how do we not loose data?
  • 11. And stay within our AWS / Rackspace budget
  • 13. 13 • Track activities of international threat actor groups, nation-state, criminal or hactivist • Intrusion attempts • Actual breaches • Profile adversary activity • Analysis to understand their motives, anticipate actions and prevent damage Adversary Profiling & Hunting
  • 14. 14 • Machine events • Endpoint intrusion detection • Anomalies/indicators of attack or compromise • Machine learning • Training models based on patterns from historical data • Predict potential threats • profiling for adversary Identification • Stream Processing
  • 15. Data Requirements & Description • Streaming event data • Log messages • User activity records • System ops & metrics data • Disparate data sources • Wildly differing data structures 15
  • 16. Massive Amounts Of Data 16 • One machine can generate 2+ TB per day • Tracking millions of devices • 1 million writes per second - bursty • High % writes, lower % reads • TTL
  • 18. WE DON'T HAVE TO LOOK FAR TO LOOK BACK 18 Rethinking Architecture
  • 19. 19 Most batch analytics flow from several years ago looked like...
  • 20. STREAMING & DATA SCIENCE 20 Rethinking Architecture
  • 21. Streaming I need fast access to historical data on the fly for predictive modeling with real time data from the stream. 21
  • 22. Not A Stream, A Flood • Data emitters • Netflix: 1 - 2 million events per second at peak • 750 billion events per day • LinkedIn: > 500 billion events per day • Data ingesters • Netflix: 50 - 100 billion events per day • LinkedIn: 2.5 trillion events per day • 1 Petabyte of streaming data 22
  • 23. Which Translates To • Do it fast • Do it cheap • Do it at scale 23
  • 24. Challenges • Code changes at runtime • Distributed Data Consistency • Ordering guarantees • Complex compute algorithms 24
  • 25. Oh, and don't lose data 25
  • 26. Strategies • Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management 26 • Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
  • 27. AND THEN WE GREEKED OUT 27 Rethinking Architecture
  • 28. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. 28
  • 29. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. • An approach • Coined by Nathan Marz • This was a huge stride forward 29
  • 32. Implementing Is Hard 32 • Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places
  • 33. Performance Tuning & Monitoring: on so many systems 33 Also hard
  • 34. Lambda Architecture An immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. 34
  • 36. Which Translates To • Performing analytical computations & queries in dual systems • Implementing transformation logic twice • Duplicate Code • Spaghetti Architecture for Data Flows • One Busy Network 36
  • 37. Why Dual Systems? • Why is a separate batch system needed? • Why support code, machines and running services of two analytics systems? 37 Counter productive on some level?
  • 38. YES 38 • A unified system for streaming and batch • Real-time processing and reprocessing • Code changes • Fault tolerance http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html - Jay Kreps
  • 40. Extract, Transform, Load (ETL) 40 "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
  • 41. Extract, Transform, Load (ETL) 41 ETL involves • Extraction of data from one system into another • Transforming it • Loading it into another system
  • 42. Extract, Transform, Load (ETL) "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm 42 Also unnecessarily redundant and often typeless
  • 43. ETL 43 • Each ETL step can introduce errors and risk • Can duplicate data after failover • Tools can cost millions of dollars • Decreases throughput • Increased complexity
  • 44. ETL • Writing intermediary files • Parsing and re-parsing plain text 44
  • 45. And let's duplicate the pattern over all our DataCenters 45
  • 46. 46 These are not the solutions you're looking for
  • 47. REVISITING THE GOAL & THE STACK 47
  • 48. Removing The 'E' in ETL Thanks to technologies like Avro and Protobuf we don’t need the “E” in ETL. Instead of text dumps that you need to parse over multiple systems: Scala & Avro (e.g.) • Can work with binary data that remains strongly typed • A return to strong typing in the big data ecosystem 48
  • 49. Removing The 'L' in ETL If data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load". • From there each consumer can do their own transformations 49
  • 52. Strategy Technologies Scalable Infrastructure / Elastic Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka My Nerdy Chart 52
  • 53. SMACK • Scala/Spark • Mesos • Akka • Cassandra • Kafka 53
  • 55. Spark Streaming • One runtime for streaming and batch processing • Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage 55
  • 56. How do I merge historical data with data in the stream? 56
  • 57. Join Streams With Static Data val ssc = new StreamingContext(conf, Milliseconds(500)) ssc.checkpoint("checkpoint") val staticData: RDD[(Int,String)] = ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func) val stream: DStream[(Int,String)] = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)) .transform { events => events.join(staticData)) .saveToCassandra(keyspace,table) ssc.start() 57
  • 58. Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict Spark MLLib 58
  • 59. Spark Streaming & ML 59 val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
  • 60. Apache Mesos Open-source cluster manager developed at UC Berkeley. Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. 60
  • 61. Akka High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams 61
  • 62. Akka Actors A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance 62
  • 64. import akka.actor._ class NodeGuardianActor(args...) extends Actor with SupervisorStrategy { val temperature = context.actorOf( Props(new TemperatureActor(args)), "temperature") val precipitation = context.actorOf( Props(new PrecipitationActor(args)), "precipitation") override def preStart(): Unit = { /* lifecycle hook: init */ } def receive : Actor.Receive = { case Initialized => context become initialized } def initialized : Actor.Receive = { case e: SomeEvent => someFunc(e) case e: OtherEvent => otherFunc(e) } } 64
  • 65. Apache Cassandra • Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On • No single point of failure • Survive regional outages • Easy to operate • Automatic & configurable replication 65
  • 66. Apache Cassandra • Very flexible data modeling (collections, user defined types) and changeable over time • Perfect for ingestion of real time / machine data • Huge community 66
  • 67. Spark Cassandra Connector • NOSQL JOINS! • Write & Read data between Spark and Cassandra • Compatible with Spark 1.4 • Handles Data Locality for Speed • Implicit type conversions • Server-Side Filtering - SELECT, WHERE, etc. • Natural Timeseries Integration 67 http://github.com/datastax/spark-cassandra-connector
  • 68. KillrWeather 68 http://github.com/killrweather/killrweather A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments. http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/ databricks/apps/weather
  • 69. 69 • High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
  • 70. Spark Streaming & Kafka val context = new StreamingContext(conf, Seconds(1)) val wordCount = KafkaUtils.createStream(context, ...) .flatMap(_.split(" ")) .map(x => (x, 1)) .reduceByKey(_ + _) wordCount.saveToCassandra(ks,table) context.start() // start receiving and computing 70
  • 71. 71 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_))
 
 kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 /** RawWeatherData: wsid, year, month, day, oneHourPrecip */
 kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 
 /** Now the [[StreamingContext]] can be started. */
 context.parent ! OutputStreamInitialized
 
 def receive : Actor.Receive = {…} } Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
  • 72. 72 /** For a given weather station, calculates annual cumulative precip - or year to date. */
 class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
 
 def receive : Actor.Receive = {
 case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
 case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
 }
 
 /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
 def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync()
 .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
 
 /** Returns the 10 highest temps for any station in the `year`. */
 def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
 val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
 ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
 
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync().map(toTopK) pipeTo requester
 }
 }
  • 73. A New Approach • One Runtime: streaming, scheduled • Simplified architecture • Allows us to • Write different types of applications • Write more type safe code • Write more reusable code 73
  • 74. Need daily analytics aggregate reports? Do it in the stream, save results in Cassandra for easy reporting as needed - with data locality not offered by S3.
  • 75. FiloDB Distributed, columnar database designed to run very fast analytical queries • Ingest streaming data from many streaming sources • Row-level, column-level operations and built in versioning offer greater flexibility than file-based technologies • Currently based on Apache Cassandra & Spark • github.com/tuplejump/FiloDB 75
  • 76. FiloDB • Breakthrough performance levels for analytical queries • Performance comparable to Parquet • One to two orders of magnitude faster than Spark on Cassandra 2.x • Versioned - critical for reprocessing logic/code changes • Can simplify your infrastructure dramatically • Queries run in parallel in Spark for scale-out ad-hoc analysis • Space-saving techniques 76
  • 78. Architectyr? 78 "This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
  • 80. 80
  • 83. I'm speaking at QCon SF on the broader topic of Streaming at Scale http://qconsf.com/sf2015/track/streaming-data-scale 83