SlideShare a Scribd company logo
1 of 84
Download to read offline
Streaming Analytics with Spark,
Kafka, Cassandra, and Akka
Helena Edelson
VP of Product Engineering @Tuplejump
• Committer / Contributor: Akka, FiloDB, Spark Cassandra
Connector, Spring Integration
• VP of Product Engineering @Tuplejump
• Previously: Sr Cloud Engineer / Architect at VMware,
CrowdStrike, DataStax and SpringSource
Who
@helenaedelson
github.com/helena
Tuplejump
Tuplejump Data Blender combines sophisticated data collection
with machine learning and analytics, to understand the intention of
the analyst, without disrupting workflow.

• Ingest streaming and static data from disparate data sources

• Combine them into a unified, holistic view 

• Easily enable fast, flexible and advanced data analysis
3
Tuplejump Open Source
github.com/tuplejump
• FiloDB - distributed, versioned, columnar analytical db for modern
streaming workloads
• Calliope - the first Spark-Cassandra integration
• Stargate - Lucene indexer for Cassandra
• SnackFS - HDFS-compatible file system for Cassandra
4
What Will We Talk About
• The Problem Domain
• Example Use Case
• Rethinking Architecture
– We don't have to look far to look back
– Streaming
– Revisiting the goal and the stack
– Simplification
THE PROBLEM DOMAIN
Delivering Meaning From A Flood Of Data
6
The Problem Domain
Need to build scalable, fault tolerant, distributed data
processing systems that can handle massive amounts of
data from disparate sources, with different data structures.
7
Translation
How to build adaptable, elegant systems
for complex analytics and learning tasks
to run as large-scale clustered dataflows
8
How Much Data
Yottabyte = quadrillion gigabytes or septillion
bytes
9
We all have a lot of data
• Terabytes
• Petabytes...
https://en.wikipedia.org/wiki/Yottabyte
Delivering Meaning
• Deliver meaning in sec/sub-sec latency
• Disparate data sources & schemas
• Billions of events per second
• High-latency batch processing
• Low-latency stream processing
• Aggregation of historical from the stream
While We Monitor, Predict & Proactively Handle
• Massive event spikes
• Bursty traffic
• Fast producers / slow consumers
• Network partitioning & Out of sync systems
• DC down
• Wait, we've DDOS'd ourselves from fast streams?
• Autoscale issues
– When we scale down VMs how do we not lose data?
And stay within our
AWS / Rackspace budget
EXAMPLE CASE:
CYBER SECURITY
Hunting The Hunter
13
14
• Track activities of international threat actor groups,
nation-state, criminal or hactivist
• Intrusion attempts
• Actual breaches
• Profile adversary activity
• Analysis to understand their motives, anticipate actions
and prevent damage
Adversary Profiling & Hunting
15
• Machine events
• Endpoint intrusion detection
• Anomalies/indicators of attack or compromise
• Machine learning
• Training models based on patterns from historical data
• Predict potential threats
• profiling for adversary Identification
•
Stream Processing
Data Requirements & Description
• Streaming event data
• Log messages
• User activity records
• System ops & metrics data
• Disparate data sources
• Wildly differing data structures
16
Massive Amounts Of Data
17
• One machine can generate 2+ TB per day
• Tracking millions of devices
• 1 million writes per second - bursty
• High % writes, lower % reads
• TTL
RETHINKING
ARCHITECTURE
18
WE DON'T HAVE TO LOOK
FAR TO LOOK BACK
19
Rethinking Architecture
20
Most batch analytics flow from
several years ago looked like...
STREAMING & DATA SCIENCE
21
Rethinking Architecture
Streaming
I need fast access to historical data on the fly for
predictive modeling with real time data from the stream.
22
Not A Stream, A Flood
• Data emitters
• Netflix: 1 - 2 million events per second at peak
• 750 billion events per day
• LinkedIn: > 500 billion events per day
• Data ingesters
• Netflix: 50 - 100 billion events per day
• LinkedIn: 2.5 trillion events per day
• 1 Petabyte of streaming data
23
Which Translates To
• Do it fast
• Do it cheap
• Do it at scale
24
Challenges
• Code changes at runtime
• Distributed Data Consistency
• Ordering guarantees
• Complex compute algorithms
25
Oh, and don't lose data
26
Strategies
• Partition For Scale & Data Locality
• Replicate For Resiliency
• Share Nothing
• Fault Tolerance
• Asynchrony
• Async Message Passing
• Memory Management
27
• Data lineage and reprocessing in
runtime
• Parallelism
• Elastically Scale
• Isolation
• Location Transparency
AND THEN WE GREEKED OUT
28
Rethinking Architecture
Lambda Architecture
A data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and
stream processing methods.
29
Lambda Architecture
A data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and
stream processing methods.
• An approach
• Coined by Nathan Marz
• This was a huge stride forward
30
31
https://www.mapr.com/developercentral/lambda-architecture
Implementing Is Hard
33
• Real-time pipeline backed by KV store for updates
• Many moving parts - KV store, real time, batch
• Running similar code in two places
• Still ingesting data to Parquet/HDFS
• Reconcile queries against two different places
Performance Tuning & Monitoring:
on so many systems
34
Also hard
Lambda Architecture
An immutable sequence of records is captured and fed
into a batch system and a stream processing
system in parallel.
35
WAIT, DUAL SYSTEMS?
36
Challenge Assumptions
Which Translates To
• Performing analytical computations & queries in dual
systems
• Implementing transformation logic twice
• Duplicate Code
• Spaghetti Architecture for Data Flows
• One Busy Network
37
Why Dual Systems?
• Why is a separate batch system needed?
• Why support code, machines and running services of
two analytics systems?
38
Counter productive on some level?
YES
39
• A unified system for streaming and batch
• Real-time processing and reprocessing
• Code changes
• Fault tolerance
http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html - Jay Kreps
ANOTHER ASSUMPTION:
ETL
40
Challenge Assumptions
Extract, Transform, Load (ETL)
41
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
Extract, Transform, Load (ETL)
42
ETL involves
• Extraction of data from one system into another
• Transforming it
• Loading it into another system
Extract, Transform, Load (ETL)
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
43
Also unnecessarily redundant and often typeless
ETL
44
• Each ETL step can introduce errors and risk
• Can duplicate data after failover
• Tools can cost millions of dollars
• Decreases throughput
• Increased complexity
ETL
• Writing intermediary files
• Parsing and re-parsing plain text
45
And let's duplicate the pattern
over all our DataCenters
46
47
These are not the solutions you're looking for
REVISITING THE GOAL
& THE STACK
48
Removing The 'E' in ETL
Thanks to technologies like Avro and Protobuf we don’t need the
“E” in ETL. Instead of text dumps that you need to parse over
multiple systems:
Scala & Avro (e.g.)

• Can work with binary data that remains strongly typed

• A return to strong typing in the big data ecosystem
49
Removing The 'L' in ETL
If data collection is backed by a distributed messaging
system (e.g. Kafka) you can do real-time fanout of the
ingested data to all consumers. No need to batch "load".
• From there each consumer can do their own transformations
50
#NoMoreGreekLetterArchitectures
51
NoETL
52
Strategy Technologies
Scalable Infrastructure / Elastic Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
My Nerdy Chart
53
SMACK
• Scala/Spark
• Mesos
• Akka
• Cassandra
• Kafka
54
Spark Streaming
55
Spark Streaming
• One runtime for streaming and batch processing
• Join streaming and static data sets
• No code duplication
• Easy, flexible data ingestion from disparate sources to
disparate sinks
• Easy to reconcile queries against multiple sources
• Easy integration of KV durable storage
56
How do I merge historical data with data
in the stream?
57
Join Streams With Static Data
val ssc = new StreamingContext(conf, Milliseconds(500))
ssc.checkpoint("checkpoint")
val staticData: RDD[(Int,String)] =
ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func)
val stream: DStream[(Int,String)] =
KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n))
.transform { events => events.join(staticData))
.saveToCassandra(keyspace,table)
ssc.start()
58
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test Data
Your Data Extract Data To Analyze
Train your model to predict
Spark MLLib
59
Spark Streaming & ML
60
val context = new StreamingContext(conf, Milliseconds(500))
val model = KMeans.train(dataset, ...) // learn offline
val stream = KafkaUtils
.createStream(ssc, zkQuorum, group,..)
.map(event => model.predict(event.feature))
Apache Mesos
Open-source cluster manager developed at UC Berkeley.
Abstracts CPU, memory, storage, and other compute resources
away from machines (physical or virtual), enabling fault-tolerant
and elastic distributed systems to easily be built and run
effectively.
61
Akka
High performance concurrency framework for Scala and
Java
• Fault Tolerance
• Asynchronous messaging and data processing
• Parallelization
• Location Transparency
• Local / Remote Routing
• Akka: Cluster / Persistence / Streams
62
Akka Actors
A distribution and concurrency abstraction
• Compute Isolation
• Behavioral Context Switching
• No Exposed Internal State
• Event-based messaging
• Easy parallelism
• Configurable fault tolerance
63
64
Akka Actor Hierarchy
http://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala
import akka.actor._
class NodeGuardianActor(args...) extends Actor with SupervisorStrategy {
val temperature = context.actorOf(
Props(new TemperatureActor(args)), "temperature")
val precipitation = context.actorOf(
Props(new PrecipitationActor(args)), "precipitation")
override def preStart(): Unit = { /* lifecycle hook: init */ }
def receive : Actor.Receive = {
case Initialized => context become initialized
}
def initialized : Actor.Receive = {
case e: SomeEvent => someFunc(e)
case e: OtherEvent => otherFunc(e)
}
}
65
Apache Cassandra
• Extremely Fast
• Extremely Scalable
• Multi-Region / Multi-Datacenter
• Always On
• No single point of failure
• Survive regional outages
• Easy to operate
• Automatic & configurable replication 66
Apache Cassandra
• Very flexible data modeling (collections, user defined
types) and changeable over time
• Perfect for ingestion of real time / machine data
• Huge community
67
Spark Cassandra Connector
• NOSQL JOINS!
• Write & Read data between Spark and Cassandra
• Compatible with Spark 1.4
• Handles Data Locality for Speed
• Implicit type conversions
• Server-Side Filtering - SELECT, WHERE, etc.
• Natural Timeseries Integration
68
http://github.com/datastax/spark-cassandra-connector
KillrWeather
69
http://github.com/killrweather/killrweather
A reference application showing how to easily integrate streaming and
batch data processing with Apache Spark Streaming, Apache
Cassandra, Apache Kafka and Akka for fast, streaming computations
on time series data in asynchronous event-driven environments.
http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/
databricks/apps/weather
70
• High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
Spark Streaming & Kafka
val context = new StreamingContext(conf, Seconds(1))
val wordCount = KafkaUtils.createStream(context, ...)
.flatMap(_.split(" "))
.map(x => (x, 1))
.reduceByKey(_ + _)
wordCount.saveToCassandra(ks,table)
context.start() // start receiving and computing
71
72
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))



kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)

/** RawWeatherData: wsid, year, month, day, oneHourPrecip */

kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)



/** Now the [[StreamingContext]] can be started. */

context.parent ! OutputStreamInitialized



def receive : Actor.Receive = {…}
}
Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
73
/** For a given weather station, calculates annual cumulative precip - or year to date. */

class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {



def receive : Actor.Receive = {

case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)

case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)

}



/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */

def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =

ssc.cassandraTable[Double](keyspace, dailytable)

.select("precipitation")

.where("wsid = ? AND year = ?", wsid, year)

.collectAsync()

.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester



/** Returns the 10 highest temps for any station in the `year`. */

def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {

val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,

ssc.sparkContext.parallelize(aggregate).top(k).toSeq)



ssc.cassandraTable[Double](keyspace, dailytable)

.select("precipitation")

.where("wsid = ? AND year = ?", wsid, year)

.collectAsync().map(toTopK) pipeTo requester

}

}
A New Approach
• One Runtime: streaming, scheduled
• Simplified architecture
• Allows us to
• Write different types of applications
• Write more type safe code
• Write more reusable code
74
Need daily analytics aggregate reports? Do it in the stream, save
results in Cassandra for easy reporting as needed - with data
locality not offered by S3.
FiloDB
Distributed, columnar database designed to run very fast
analytical queries
• Ingest streaming data from many streaming sources
• Row-level, column-level operations and built in versioning
offer greater flexibility than file-based technologies
• Currently based on Apache Cassandra & Spark
• github.com/tuplejump/FiloDB
76
FiloDB
• Breakthrough performance levels for analytical queries
• Performance comparable to Parquet
• One to two orders of magnitude faster than Spark on
Cassandra 2.x
• Versioned - critical for reprocessing logic/code changes
• Can simplify your infrastructure dramatically
• Queries run in parallel in Spark for scale-out ad-hoc analysis
• Space-saving techniques
77
WRAPPING UP
78
Architectyr?
79
"This is a giant mess"
- Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
80
Simplified
81
82
www.tuplejump.com
info@tuplejump.com@tuplejump
83
@helenaedelson
github.com/helena
slideshare.net/helenaedelson
THANK YOU!
I'm speaking at QCon SF on the broader
topic of Streaming at Scale
http://qconsf.com/sf2015/track/streaming-data-scale
84

More Related Content

What's hot

What's hot (20)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 

Viewers also liked

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 

Viewers also liked (10)

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 

Similar to Streaming Analytics with Spark, Kafka, Cassandra and Akka

Similar to Streaming Analytics with Spark, Kafka, Cassandra and Akka (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 

Streaming Analytics with Spark, Kafka, Cassandra and Akka

  • 1. Streaming Analytics with Spark, Kafka, Cassandra, and Akka Helena Edelson VP of Product Engineering @Tuplejump
  • 2. • Committer / Contributor: Akka, FiloDB, Spark Cassandra Connector, Spring Integration • VP of Product Engineering @Tuplejump • Previously: Sr Cloud Engineer / Architect at VMware, CrowdStrike, DataStax and SpringSource Who @helenaedelson github.com/helena
  • 3. Tuplejump Tuplejump Data Blender combines sophisticated data collection with machine learning and analytics, to understand the intention of the analyst, without disrupting workflow. • Ingest streaming and static data from disparate data sources • Combine them into a unified, holistic view • Easily enable fast, flexible and advanced data analysis 3
  • 4. Tuplejump Open Source github.com/tuplejump • FiloDB - distributed, versioned, columnar analytical db for modern streaming workloads • Calliope - the first Spark-Cassandra integration • Stargate - Lucene indexer for Cassandra • SnackFS - HDFS-compatible file system for Cassandra 4
  • 5. What Will We Talk About • The Problem Domain • Example Use Case • Rethinking Architecture – We don't have to look far to look back – Streaming – Revisiting the goal and the stack – Simplification
  • 6. THE PROBLEM DOMAIN Delivering Meaning From A Flood Of Data 6
  • 7. The Problem Domain Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures. 7
  • 8. Translation How to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows 8
  • 9. How Much Data Yottabyte = quadrillion gigabytes or septillion bytes 9 We all have a lot of data • Terabytes • Petabytes... https://en.wikipedia.org/wiki/Yottabyte
  • 10. Delivering Meaning • Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream
  • 11. While We Monitor, Predict & Proactively Handle • Massive event spikes • Bursty traffic • Fast producers / slow consumers • Network partitioning & Out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues – When we scale down VMs how do we not lose data?
  • 12. And stay within our AWS / Rackspace budget
  • 14. 14 • Track activities of international threat actor groups, nation-state, criminal or hactivist • Intrusion attempts • Actual breaches • Profile adversary activity • Analysis to understand their motives, anticipate actions and prevent damage Adversary Profiling & Hunting
  • 15. 15 • Machine events • Endpoint intrusion detection • Anomalies/indicators of attack or compromise • Machine learning • Training models based on patterns from historical data • Predict potential threats • profiling for adversary Identification • Stream Processing
  • 16. Data Requirements & Description • Streaming event data • Log messages • User activity records • System ops & metrics data • Disparate data sources • Wildly differing data structures 16
  • 17. Massive Amounts Of Data 17 • One machine can generate 2+ TB per day • Tracking millions of devices • 1 million writes per second - bursty • High % writes, lower % reads • TTL
  • 19. WE DON'T HAVE TO LOOK FAR TO LOOK BACK 19 Rethinking Architecture
  • 20. 20 Most batch analytics flow from several years ago looked like...
  • 21. STREAMING & DATA SCIENCE 21 Rethinking Architecture
  • 22. Streaming I need fast access to historical data on the fly for predictive modeling with real time data from the stream. 22
  • 23. Not A Stream, A Flood • Data emitters • Netflix: 1 - 2 million events per second at peak • 750 billion events per day • LinkedIn: > 500 billion events per day • Data ingesters • Netflix: 50 - 100 billion events per day • LinkedIn: 2.5 trillion events per day • 1 Petabyte of streaming data 23
  • 24. Which Translates To • Do it fast • Do it cheap • Do it at scale 24
  • 25. Challenges • Code changes at runtime • Distributed Data Consistency • Ordering guarantees • Complex compute algorithms 25
  • 26. Oh, and don't lose data 26
  • 27. Strategies • Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management 27 • Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
  • 28. AND THEN WE GREEKED OUT 28 Rethinking Architecture
  • 29. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. 29
  • 30. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. • An approach • Coined by Nathan Marz • This was a huge stride forward 30
  • 32.
  • 33. Implementing Is Hard 33 • Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places
  • 34. Performance Tuning & Monitoring: on so many systems 34 Also hard
  • 35. Lambda Architecture An immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. 35
  • 37. Which Translates To • Performing analytical computations & queries in dual systems • Implementing transformation logic twice • Duplicate Code • Spaghetti Architecture for Data Flows • One Busy Network 37
  • 38. Why Dual Systems? • Why is a separate batch system needed? • Why support code, machines and running services of two analytics systems? 38 Counter productive on some level?
  • 39. YES 39 • A unified system for streaming and batch • Real-time processing and reprocessing • Code changes • Fault tolerance http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html - Jay Kreps
  • 41. Extract, Transform, Load (ETL) 41 "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
  • 42. Extract, Transform, Load (ETL) 42 ETL involves • Extraction of data from one system into another • Transforming it • Loading it into another system
  • 43. Extract, Transform, Load (ETL) "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm 43 Also unnecessarily redundant and often typeless
  • 44. ETL 44 • Each ETL step can introduce errors and risk • Can duplicate data after failover • Tools can cost millions of dollars • Decreases throughput • Increased complexity
  • 45. ETL • Writing intermediary files • Parsing and re-parsing plain text 45
  • 46. And let's duplicate the pattern over all our DataCenters 46
  • 47. 47 These are not the solutions you're looking for
  • 48. REVISITING THE GOAL & THE STACK 48
  • 49. Removing The 'E' in ETL Thanks to technologies like Avro and Protobuf we don’t need the “E” in ETL. Instead of text dumps that you need to parse over multiple systems: Scala & Avro (e.g.) • Can work with binary data that remains strongly typed • A return to strong typing in the big data ecosystem 49
  • 50. Removing The 'L' in ETL If data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load". • From there each consumer can do their own transformations 50
  • 53. Strategy Technologies Scalable Infrastructure / Elastic Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka My Nerdy Chart 53
  • 54. SMACK • Scala/Spark • Mesos • Akka • Cassandra • Kafka 54
  • 56. Spark Streaming • One runtime for streaming and batch processing • Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage 56
  • 57. How do I merge historical data with data in the stream? 57
  • 58. Join Streams With Static Data val ssc = new StreamingContext(conf, Milliseconds(500)) ssc.checkpoint("checkpoint") val staticData: RDD[(Int,String)] = ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func) val stream: DStream[(Int,String)] = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)) .transform { events => events.join(staticData)) .saveToCassandra(keyspace,table) ssc.start() 58
  • 59. Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict Spark MLLib 59
  • 60. Spark Streaming & ML 60 val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
  • 61. Apache Mesos Open-source cluster manager developed at UC Berkeley. Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. 61
  • 62. Akka High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams 62
  • 63. Akka Actors A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance 63
  • 65. import akka.actor._ class NodeGuardianActor(args...) extends Actor with SupervisorStrategy { val temperature = context.actorOf( Props(new TemperatureActor(args)), "temperature") val precipitation = context.actorOf( Props(new PrecipitationActor(args)), "precipitation") override def preStart(): Unit = { /* lifecycle hook: init */ } def receive : Actor.Receive = { case Initialized => context become initialized } def initialized : Actor.Receive = { case e: SomeEvent => someFunc(e) case e: OtherEvent => otherFunc(e) } } 65
  • 66. Apache Cassandra • Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On • No single point of failure • Survive regional outages • Easy to operate • Automatic & configurable replication 66
  • 67. Apache Cassandra • Very flexible data modeling (collections, user defined types) and changeable over time • Perfect for ingestion of real time / machine data • Huge community 67
  • 68. Spark Cassandra Connector • NOSQL JOINS! • Write & Read data between Spark and Cassandra • Compatible with Spark 1.4 • Handles Data Locality for Speed • Implicit type conversions • Server-Side Filtering - SELECT, WHERE, etc. • Natural Timeseries Integration 68 http://github.com/datastax/spark-cassandra-connector
  • 69. KillrWeather 69 http://github.com/killrweather/killrweather A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments. http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/ databricks/apps/weather
  • 70. 70 • High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
  • 71. Spark Streaming & Kafka val context = new StreamingContext(conf, Seconds(1)) val wordCount = KafkaUtils.createStream(context, ...) .flatMap(_.split(" ")) .map(x => (x, 1)) .reduceByKey(_ + _) wordCount.saveToCassandra(ks,table) context.start() // start receiving and computing 71
  • 72. 72 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_))
 
 kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 /** RawWeatherData: wsid, year, month, day, oneHourPrecip */
 kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 
 /** Now the [[StreamingContext]] can be started. */
 context.parent ! OutputStreamInitialized
 
 def receive : Actor.Receive = {…} } Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
  • 73. 73 /** For a given weather station, calculates annual cumulative precip - or year to date. */
 class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
 
 def receive : Actor.Receive = {
 case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
 case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
 }
 
 /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
 def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync()
 .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
 
 /** Returns the 10 highest temps for any station in the `year`. */
 def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
 val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
 ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
 
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync().map(toTopK) pipeTo requester
 }
 }
  • 74. A New Approach • One Runtime: streaming, scheduled • Simplified architecture • Allows us to • Write different types of applications • Write more type safe code • Write more reusable code 74
  • 75. Need daily analytics aggregate reports? Do it in the stream, save results in Cassandra for easy reporting as needed - with data locality not offered by S3.
  • 76. FiloDB Distributed, columnar database designed to run very fast analytical queries • Ingest streaming data from many streaming sources • Row-level, column-level operations and built in versioning offer greater flexibility than file-based technologies • Currently based on Apache Cassandra & Spark • github.com/tuplejump/FiloDB 76
  • 77. FiloDB • Breakthrough performance levels for analytical queries • Performance comparable to Parquet • One to two orders of magnitude faster than Spark on Cassandra 2.x • Versioned - critical for reprocessing logic/code changes • Can simplify your infrastructure dramatically • Queries run in parallel in Spark for scale-out ad-hoc analysis • Space-saving techniques 77
  • 79. Architectyr? 79 "This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
  • 81. 81
  • 84. I'm speaking at QCon SF on the broader topic of Streaming at Scale http://qconsf.com/sf2015/track/streaming-data-scale 84