Who am I
Distinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark
since 0.9
Co-creator and maintainer of
Spark Job Server
Example: Real-time
trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot”
data in stream processor
Combine with recent and historical top K
feature vectors in database
Update database recent feature vectors
Serve to users
Smart City
Streaming Data
City buses - regular telemetry (position,
velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart
cards
311 info
911 info - new emergencies
Citizens want to
know…
Where and for how long can I park my
car?
Are transportation options affected by
311 and 911 events?
How long will it take the next bus to
get here?
Where is the closest bus to where I am?
Cities want to
know…
How can I maximize parking revenue?
More granular updates to parking spots that don't
need sweeping
How does traffic affect waiting times in public
transit, and revenue?
Patterns in subway train times - is a breakdown
coming?
Population movement - where should new transit
routes be placed?
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations
as possible with as few components as
possible
Real-time: “NoETL”
Community: best of breed OSS projects with
huge adoption and commercial support
Why a message
queue?
Centralized publish-subscribe of
events
Need more processing? Add another
consumer
Buffer traffic spikes
Replay events in cases of failure
Message Queues
help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing
Intro to Apache
Kafka
Kafka is a distributed publish subscribe
system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project
in 2012
On being HARD
Many Big Data projects are open source
implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka
actually isn't a clone of an existing closed
source product
The same codebase being used for years at LinkedIn
answers the questions:
Does it scale?
Is it robust?
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings
or byte arrays
Avro is a serialization format used
extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep
track of Avro schemas
Verifies that the correct schemas are being used
Kafka Resources
Official docs - https://
kafka.apache.org/
documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry
Akka and
Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure
recovery
Cluster or local mode - you don’t always need
distribution!
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher
throughput, more complex analysis / ML
possible
Same programming model as batch
Why Spark?
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Benefits of Unified Libraries
Optimizations can be shared between libraries
Core
Project Tungsten
MLlib
Shared statistics libraries
Spark Streaming
GC and memory management
Mix and match
modules
Easily go from DataFrames (SQL) to
MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)
numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)
numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
Spark SQL Data Sources API
Enables custom data sources to participate in
SparkSQL = DataFrames + Catalyst
Production Impls
spark-csv (Databricks)
spark-avro (Databricks)
spark-cassandra-connector (DataStax)
elasticsearch-hadoop (Elastic.co)
Streaming Fault Tolerance
Incoming data is replicated to 1
other node
Write Ahead Log for sources that
support ACKs
Checkpointing for recovery if Driver
fails
Direct Kafka Streaming: KafkaRDD
No single Receiver
Parallelizable
No Write Ahead Log
Kafka *is* the Write Ahead Log!
KafkaRDD stores Kafka offsets
KafkaRDD partitions recover from offsets
What Kind of State?
Non-persistent / in-memory:
concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate
storage
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or
on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of
data
Snapshotting for recovery
Fault
Tolerance
&
Availability
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add
themselves to the cluster
• DataStax drivers automatically discover
new nodes
Cassandra Data
Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3
Partition 1
Partition 2
Partition 3
City Bus Data
Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single
bus for a range of time
Can also query most recent location + speed
of all buses (slower)
1020 s 1010 s 1000 s
Bus A speed, GPS
Bus B
Bus C
Using Cassandra for
Short Term Storage
Idea is store and read small values
Idempotent writes + huge write
capacity = ideal for streaming
ingestion
For example, store last few (latest +
last N) snapshots of buses, taxi
locations, recent traffic info
I need to read lots
of data, fast!!
- Ad hoc analytics of events
- More specialized / geospatial
- Building ML models from
large quantities of data
- Storing scored/classified data
from models
- OLAP / Data Warehousing
Can Cassandra
Handle Batch?
Cassandra tables are much better at
lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage
and analysis
But are files compatible with streaming?
Lambda is Hard
and Expensive
Very high TCO - Many moving parts - KV store,
real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data,
network hops, translating domain objects
Reconcile queries against two different places
Can Cassandra do
batch and ad-hoc?
Yes, it can be competitive with Hadoop
actually….
If you know how to be creative with storing your
data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.
Introduction to
FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x
faster than regular Cassandra tables
Very fine grained filtering for sub-second
concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/
Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
Combining FiloDB
+ Cassandra
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted /
scored data
FiloDB + Cassandra
Robust, peer to peer, proven storage
platform
Use for short term snapshots, dashboards
Use for efficient long term event
storage & ad hoc querying
Use as a source to build detailed
models