Laying down the smack on your data pipelines

@PatrickMcFadin
Patrick McFadin 
Chief Evangelist for Apache Cassandra
Laying down the SMACK on your
data pipelines
1

Spark
Mesos
Akka
Cassandra
Kafka

CassandraAkka
SparkKafka
Organize Process Store
Mesos
KafkaKafkaKafka
SparkSparkSpark
AkkaAkkaAkka
CassandraCassandraCassandra

CassandraAkka
SparkKafka

Managing Weather Data
Windsor California
67.3 F
Rainfall total: 1.2cm
Today:
High: 73.4F
Low : 51.4F
Yesterday:
High: 75.2F
Low : 52.3F
Our
Magical
App
Reactive and immediate
Batch

KillrWeather
KillrWeather
Windsor California
67.3 F
Rainfall total: 1.2cm
Today:
High: 73.4F
Low : 51.4F
Yesterday:
High: 75.2F
Low : 52.3F

https://github.com/killrweather/killrweather

Kafka decouples data pipelines

The problem
Kitchen
Hamburger
please
Meat disk
on bread
please

The problem
Kitchen
Order Queue
Hamburger
please
Order

The problem
Kitchen
Order Queue

The problem
Kitchen
Order Queue
Meat disk
on bread
please
You mean a
Hamburger?
Uh yeah.
That.
Order

Order from chaos
Producer
Consumer
Topic = FoodOrder

Order from chaos
Producer
Topic = Food
Order
1
Consumer

Order from chaos
Producer
Topic = Food
Order
1
Order
Consumer

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer
Order

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer
Order
3

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order

Order from chaos
Producer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5

Scale
Producer
Topic = Hamburgers
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
Topic = Pizza
Order
1
Order
2
Order
3
Order
4
Order
5
Topic = Food

Kafka
Producer
Topic = Temperature
Temp
1
Temp
2
Consumer
Temp
3
Temp
4
Temp
5
Collection
API
Temperature
Processor
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Precipitation
Processor
Broker

Kafka
Producer
Topic = Temperature
Temp
1
Temp
2
Consumer
Temp
3
Temp
4
Temp
5
Collection
API
Temperature
Processor
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Precipitation
Processor
Broker
Partition 0
Partition 0

Kafka
Producer Consumer
Collection
API
Temperature
Processor
Precipitation
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Temperature
Processor

Kafka
Producer Consumer
Collection
API
Temperature
Processor
Precipitation
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Temperature
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Topic Temperature
Replication Factor = 2
Topic Precipitation

Kafka
Producer
Consumer
Collection
API
Temperature
Processor
Precipitation
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1 Temperature
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Temperature
Processor
Temperature
Processor
Precipitation
Processor
Topic Temperature
Topic Precipitation

Guarantees
Order
•Messages are ordered as they are sent by the
producer
•Consumers see messages in the order they were
inserted by the producer
Durability
•Messages are delivered at least once
•With a Replication Factor N up to N-1 server failures
can be tolerated without losing committed messages

Akka in a nutshell
• Highly concurrent
• Reactive
• Fully distributed
• Completely elastic and resilient
Actor
Mailbox
Actor
Mailbox
Actor
Mailbox
Actor
Mailbox

KafkaStreamingActor
• Pulls from Kafka Queue
• Immediately saves to Cassandra Counter
kafkaStream.map { weather => 
(weather.wsid, weather.year, weather.month, weather.day,
weather.oneHourPrecip) 
}.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

Temperature High/Low Stream
Weather
Stations
Receive API
Producer
TemperatureActor
TemperatureActor
TemperatureActor
Consumer

TemperatureActor
class TemperatureActor(sc: SparkContext, settings: WeatherSettings)
extends WeatherActor with ActorLogging {
def receive : Actor.Receive = {
case e: GetDailyTemperature => daily(e.day, sender)
case e: DailyTemperature => store(e)
case e: GetMonthlyHiLowTemperature => highLow(e, sender)
}

TemperatureActor
/** Computes and sends the daily aggregation to the `requester` actor.
* We aggregate this data on-demand versus in the stream.
*
* For the given day of the year, aggregates 0 - 23 temp values to statistics:
* high, low, mean, std, etc., and persists to Cassandra daily temperature table
* by weather station, automatically sorted by most recent - due to our cassandra schema -
* you don't need to do a sort in spark.
*
* Because the gov. data is not by interval (window/slide) but by specific date/time
* we look for historic data for hours 0-23 that may or may not already exist yet
* and create stats on does exist at the time of request.
*/
def daily(day: Day, requester: ActorRef): Unit =
(for {
aggregate <- sc.cassandraTable[Double](keyspace, rawtable)
.select("temperature").where("wsid = ? AND year = ? AND month = ? AND day = ?",
day.wsid, day.year, day.month, day.day)
.collectAsync()
} yield forDay(day, aggregate)) pipeTo requester

TemperatureActor
/**
* Would only be handling handles 0-23 small items or fewer.
*/
private def forDay(key: Day, temps: Seq[Double]): WeatherAggregate =
if (temps.nonEmpty) {
val stats = StatCounter(temps)
val data = DailyTemperature(
key.wsid, key.year, key.month, key.day,
high = stats.max, low = stats.min,
mean = stats.mean, variance = stats.variance, stdev = stats.stdev)
self ! data
data
} else NoDataAvailable(key.wsid, key.year, classOf[DailyTemperature])

TemperatureActor
/** Stores the daily temperature aggregates asynchronously which are triggered
* by on-demand requests during the `forDay` function's `self ! data`
* to the daily temperature aggregation table.
*/
private def store(e: DailyTemperature): Unit =
sc.parallelize(Seq(e)).saveToCassandra(keyspace, dailytable)

Token
Server
•Consistent hash between 2-63
and 264
•Each node owns a range of those
values
•The token is the beginning of that
range to the next node’s token value
•Virtual Nodes break these down
further
Data
Token Range
0 …

Cluster Server
Token Range
0 0-100
0-100

Cluster Server
Token Range
0 0-50
51 51-100
Server
0-50
51-100

Cluster Server
Token Range
0 0-25
26 26-50
51 51-75
76 76-100
Server
ServerServer
0-25
76-100
26-5051-75

Table
CREATE TABLE weather_station ( 
id text, 
name text, 
country_code text, 
state_code text, 
call_sign text, 
lat double, 
long double, 
elevation double, 
PRIMARY KEY(id) 
);
Table Name
Column Name
Column CQL Type
Primary Key Designation Partition Key

Queries supported
CREATE TABLE raw_weather_data ( 
wsid text, 
year int, 
month int, 
day int, 
hour int, 
temperature double, 
dewpoint double, 
pressure double, 
wind_direction int, 
wind_speed double, 
sky_condition int, 
sky_condition_text text, 
one_hour_precip double, 
six_hour_precip double, 
PRIMARY KEY ((wsid), year, month, day, hour) 
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Get weather data given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time

Replication
10.0.0.1
00-25
DC1
DC1: RF=1
Node Primary
10.0.0.1 00-25
10.0.0.2 26-50
10.0.0.3 51-75
10.0.0.4 76-100
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75

Replication
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
DC1
DC1: RF=2
Node Primary Replica
10.0.0.1 00-25 76-100
10.0.0.2 26-50 00-25
10.0.0.3 51-75 26-50
10.0.0.4 76-100 51-75
76-100
00-25
26-50
51-75

Replication
DC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50

Consistency
DC1
DC1: RF=3
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50
Client
Write to
partition 15

Consistency level
Consistency Level Number of Nodes Acknowledged
One One - Read repair triggered
Local One One - Read repair in local DC
Quorum 51%
Local Quorum 51% in local DC

Consistency
DC1
DC1: RF=3
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50
Client
Write to
partition 15
CL= One

Consistency
DC1
DC1: RF=3
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50
Client
Write to
partition 15
CL= Quorum

Multi-datacenter
DC1
DC1: RF=3
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50
Client
Write to
partition 15
DC2
10.1.0.1
00-25
10.1.0.4
76-100
10.1.0.2
26-50
10.1.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
DC2: RF=3

Great combo
Store a ton of data Analyze a ton of data

Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis

Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Spark Connector

Executer
Master
Worker
Executer
Executer
Server

Master
Worker
Worker
Worker Worker
0-24Token Ranges
0-100
25-49
50-74
75-99
I will only
analyze 25% of
the data.

Master
Worker
Worker
Worker Worker
0-24
25-49
50-74
75-9975-99
0-24
25-49
50-74
AnalyticsTransactional

Executer
Master
Worker
Executer
Executer
75-99
SELECT *
FROM keyspace.table
WHERE token(pk) > 75
AND token(pk) <= 99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
Spark Connector

Executer
Master
Worker
Executer
Executer
75-99
Spark RDD
Spark Partition
Spark Partition
Spark Partition

Simple example
/** keyspace & table */ 
val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") 
 
 
/** get a simple count of all the rows in the raw_weather_data table */ 
val rowCount = tableRDD.count() 
 
 
println(s"Total Rows in Raw Weather Table: $rowCount") 
sc.stop()
Executer
SELECT *
FROM isd_weather_data.raw_weather_data
Spark RDD
Spark Partition
Spark Connector

Saving back the weather data
val cc = new CassandraSQLContext(sc) 
cc.setKeyspace("isd_weather_data") 
cc.sql(""" 
SELECT wsid, year, month, day, max(temperature) high, min(temperature) low 
FROM raw_weather_data 
WHERE month = 6 
AND temperature !=0.0 
GROUP BY wsid, year, month, day; 
""") 
.map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} 
.saveToCassandra("isd_weather_data", "daily_aggregate_temperature")

DStream

Sliding Windows

CassandraAkka
SparkKafkaKafkaKafkaKafka
SparkSparkSpark
AkkaAkkaAkka
CassandraCassandraCassandra
I need CPU!!
I need memory!!
Got you covered

Kafka
Akka AkkaAkka
Kafka
Spark Spark

Kafka
Akka
Akka
Akka
Kafka
Spark Spark

Kafka on Mesos example
Scheduler
• Provides the operational automation for a Kafka Cluster
• Manages the changes to the broker's configuration
• Exposes a REST API for the CLI to use or any other client
• Runs on Marathon for high availability
Executor
• The executor interacts with the kafka broker as an
intermediary to the scheduler

CassandraAkka
SparkKafka
Mesos

Go get your SMACK on
Thank you!
Follow me on twitter: @PatrickMcFadin

Laying down the smack on your data pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Laying down the smack on your data pipelines

Similar to Laying down the smack on your data pipelines (20)

More from Patrick McFadin

More from Patrick McFadin (16)

Recently uploaded

Recently uploaded (20)

Laying down the smack on your data pipelines