Kafka, Spark and Cassandra
instaclustr.com
Who am I and what do I do?
• Ben Bromhead
• Co-founder and CTO of Instaclustr -> www.instaclustr.com
• Instaclustr provides Cassandra-as-a-Service in the cloud.
• Currently run in IBM Softlayer, AWS and Azure
• 500+ nodes under management
What will this talk cover?
• An introduction to Cassandra
• An introduction to Spark
• An introduction to Kafka
• Building a data pipeline with Cassandra, Spark & Kafka
What happens when you have more data than could fit on a
single server?
Throw money away at the problem
Introducing Cassandra
• BigTable (2006) - 1 Key: Lots of values, Fast sequential access
• Dynamo (2007) - Reliable, Performant, Always On,
• Cassandra (2008) - Dynamo Architecture, BigTable data model and
storage
One database, many servers
• All servers (nodes) participate in
the cluster
• Shared nothing
• Need more capacity add more
servers
• Multiple servers == built in
redundancy
1
3
24
How does it work ?
0
4
28
Partitioning
Name Age Postcode Gender
Alice 34 2000 F
Bob 26 2000 M
Eve 25 2004 F
Frank 41 2902 M
How does it work ?
0
4
28
client
consistentHash(“Alice”)
A brief intro to tuneable consistency
• Cassandra is considered to be a database that favours Availability and
Partition Tolerance.
• Let’s you change those characteristics per query to suit your
application requirement.
• Define your replication factor on the schema level
• Define your consistency level at query time
How does it work ?
client
consistentHash(“Alice”)
0
4
28
Replication Factor = 3
What are the benefits to this approach
• Linear scalability
• High Availability
• Use commodity hardware
Linear scalability
48 Nodes 96 Nodes 144 Nodes 288 Nodes
Writes per
second per node
10,900 11,460 11,900 11,456
Mean Latency 0.0117ms 0.0134ms 0.0148ms 0.0139ms
Cluster Writes
per second
174,373 366,828 537,172 1,099,837
Linear scalability
High Availability
“During Hurricane Sandy, we lost an entire
data center. Completely. Lost. It. Our
application fail-over resulted in us losing just a
few moments of serving requests for a particular
region of the country, but our data in
Cassandra never went offline.”
Nathan Milford, Outbrain’s head of U.S. IT operations management
Commodity Hardware
How do we keep data consistent ?
client
consistentHash(“Alice”)
0
4
28
CL.QUORUM (50% + 1)
Write
Ack
Ack
X
Add capacity
1
5
37
client
consistentHash(“Alice”)
Analytics & Cassandra
• What about ad-hoc queries?
• What was the minimum, maximum and average latency for a given
client
• Give me all devices that had a temperature > 40 for longer than 20
minutes
• Top 10 locations where vehicles recorded speeds > 60
Introducing Spark
• A distributed computing engine.
• For very large datasets.
• Essentially a way to run any querie or algorithms over a very large set of
data.
• Works with the existing hadoop ecosystem.
Spark
• Faster and Better than hadoop
• In-memory (100x faster)
• Intelligent caching on disk (10x faster)
• Fault tolerant, immutable datasets and intermediate steps (DRY)
• Sane API and integrations (never write map/reduce jobs again)
Spark vs Hadoop
Initial input Intermediate step Final output
Initial input Intermediate step Final output
How fast is spark?
Hadoop MR Record Spark Record Spark 1PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s 618 GB/s 570 GB/s
Daytona Rules Yes Yes No
Network
dedicated data
center, 10Gbps
virtualized (EC2)
10Gbps network
virtualized (EC2)
10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
A quick use case
Square Kilometre array (SKA)
700TB/Second raw data
Spark part of the data processing pipeline
See http://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann
Spark is also easy
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark word count
spark.cassandraTable(“Keyspace","Table").
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Spark
Hadoop word count
public class WordCount {
public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
Context context)
throws IOException,
InterruptedException {
StringTokenizer itr = new
StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends
Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new
IntWritable();
public void reduce(Text key,
Iterable<IntWritable> values,
Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args)
throws Exception {
Configuration conf = new
Configuration();
String[] otherArgs = new
GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new
Path(otherArgs[1]));
System.exit(job.waitForCompletion(true)
? 0 : 1);
}
}
Spark
public class WordCount {
public static void main(String[] args) {
JavaRDD<String> textFile = spark.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile(“hdfs://...");
}
}
Spark word count
Spark
What happens under the hood?
val textFile = spark.textFile("s3a://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
textFile flatmap map reduceByKey collect
Spark
Spark + Cassandra
• Analytics with a super fast operational db?
• Use https://github.com/datastax/spark-cassandra-connector
Spark + Cassandra
Executor
Executor
Executor
Worker
Server
Spark + Cassandra
0
4
28
Worker
Worker
Worker
Worker
Master
Spark + Cassandra
• We now have a great platform to work with data already in Cassandra
• But what if we have a stream of data (e.g. a lot of devices sending stuff
to us all the time)
• We want answers now and as events happen
Spark + Cassandra
vs
Streaming
Spark + Cassandra
Spark + Cassandra
• After you have performed calculations on the DStream object provided
by Spark, persist to Cassandra
• Simply call .saveToCassandra(Table,Keyspace)
Spark + Cassandra
• What about ingest?
• Enter Kafka
What is Kafka
• Unified platform for handling message feeds (aka message bus)
• High Volume
• Derived Feeds
• Support large feeds from offline ingest
• Low latency messaging
• Fault tolerance during machine failure
What is Kafka
• Publish / Subscribe message architecture
• Consumers “receive” messages
• Publishers “send” messages
• Message routing determined based on “Topic”
• Topics are split into partitions which are replicated
What is Kafka
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
What is Kafka
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
A quick case study
• Kafka at LinkedIn
• 15 broken
• 15,500 partitions (replication factor 2)
• 400,000 msg/s
• event processing
Spark + Cassandra + Kafka
• Why Kafka?
• Pluggable receivers for MQTT and HTTP ingest
• Spark Streaming can consume directly from a Kafka Queue and write
to Cassandra
Spark + Cassandra + Kafka
Putting it all together:
Kafka cluster
Master
0
4
28
Wor
ker
Wor
ker
Wor
ker
Wor
ker
MQTT bridge
Spark + Cassandra + Kafka
Putting it all together:
Kafka cluster
Master
0
4
28
Wor
ker
Wor
ker
Wor
ker
Wor
ker
MQTT bridge
Lambda Architecture!
• A highly distributed, resilient and highly available ingest, compute and
storage engine.
• Leverage additional spark libraries to add capabilities to your project.
• Bonus: SparkML, bringing machine learning and artificial intelligence to
your data pipeline.
Spark + Cassandra + Kafka
Questions

Kafka spark cassandra webinar feb 16 2016

  • 1.
    Kafka, Spark andCassandra instaclustr.com
  • 2.
    Who am Iand what do I do? • Ben Bromhead • Co-founder and CTO of Instaclustr -> www.instaclustr.com • Instaclustr provides Cassandra-as-a-Service in the cloud. • Currently run in IBM Softlayer, AWS and Azure • 500+ nodes under management
  • 3.
    What will thistalk cover? • An introduction to Cassandra • An introduction to Spark • An introduction to Kafka • Building a data pipeline with Cassandra, Spark & Kafka
  • 4.
    What happens whenyou have more data than could fit on a single server?
  • 5.
    Throw money awayat the problem
  • 6.
    Introducing Cassandra • BigTable(2006) - 1 Key: Lots of values, Fast sequential access • Dynamo (2007) - Reliable, Performant, Always On, • Cassandra (2008) - Dynamo Architecture, BigTable data model and storage
  • 7.
    One database, manyservers • All servers (nodes) participate in the cluster • Shared nothing • Need more capacity add more servers • Multiple servers == built in redundancy 1 3 24
  • 8.
    How does itwork ? 0 4 28
  • 9.
    Partitioning Name Age PostcodeGender Alice 34 2000 F Bob 26 2000 M Eve 25 2004 F Frank 41 2902 M
  • 10.
    How does itwork ? 0 4 28 client consistentHash(“Alice”)
  • 11.
    A brief introto tuneable consistency • Cassandra is considered to be a database that favours Availability and Partition Tolerance. • Let’s you change those characteristics per query to suit your application requirement. • Define your replication factor on the schema level • Define your consistency level at query time
  • 12.
    How does itwork ? client consistentHash(“Alice”) 0 4 28 Replication Factor = 3
  • 13.
    What are thebenefits to this approach • Linear scalability • High Availability • Use commodity hardware
  • 14.
    Linear scalability 48 Nodes96 Nodes 144 Nodes 288 Nodes Writes per second per node 10,900 11,460 11,900 11,456 Mean Latency 0.0117ms 0.0134ms 0.0148ms 0.0139ms Cluster Writes per second 174,373 366,828 537,172 1,099,837
  • 15.
  • 16.
    High Availability “During HurricaneSandy, we lost an entire data center. Completely. Lost. It. Our application fail-over resulted in us losing just a few moments of serving requests for a particular region of the country, but our data in Cassandra never went offline.” Nathan Milford, Outbrain’s head of U.S. IT operations management
  • 17.
  • 18.
    How do wekeep data consistent ? client consistentHash(“Alice”) 0 4 28 CL.QUORUM (50% + 1) Write Ack Ack X
  • 19.
  • 20.
    Analytics & Cassandra •What about ad-hoc queries? • What was the minimum, maximum and average latency for a given client • Give me all devices that had a temperature > 40 for longer than 20 minutes • Top 10 locations where vehicles recorded speeds > 60
  • 21.
    Introducing Spark • Adistributed computing engine. • For very large datasets. • Essentially a way to run any querie or algorithms over a very large set of data. • Works with the existing hadoop ecosystem.
  • 22.
    Spark • Faster andBetter than hadoop • In-memory (100x faster) • Intelligent caching on disk (10x faster) • Fault tolerant, immutable datasets and intermediate steps (DRY) • Sane API and integrations (never write map/reduce jobs again)
  • 23.
    Spark vs Hadoop Initialinput Intermediate step Final output Initial input Intermediate step Final output
  • 24.
    How fast isspark? Hadoop MR Record Spark Record Spark 1PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 physical 6592 virtualized 6080 virtualized Cluster disk throughput 3150 GB/s 618 GB/s 570 GB/s Daytona Rules Yes Yes No Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
  • 25.
    A quick usecase Square Kilometre array (SKA) 700TB/Second raw data Spark part of the data processing pipeline See http://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann
  • 26.
    Spark is alsoeasy val textFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark word count spark.cassandraTable(“Keyspace","Table"). .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .saveAsTextFile("hdfs://...")
  • 27.
    Spark Hadoop word count publicclass WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 28.
    Spark public class WordCount{ public static void main(String[] args) { JavaRDD<String> textFile = spark.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile(“hdfs://..."); } } Spark word count
  • 29.
    Spark What happens underthe hood? val textFile = spark.textFile("s3a://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .collect() textFile flatmap map reduceByKey collect
  • 30.
  • 31.
    Spark + Cassandra •Analytics with a super fast operational db? • Use https://github.com/datastax/spark-cassandra-connector
  • 32.
  • 33.
  • 34.
    Spark + Cassandra •We now have a great platform to work with data already in Cassandra • But what if we have a stream of data (e.g. a lot of devices sending stuff to us all the time) • We want answers now and as events happen
  • 35.
  • 36.
  • 37.
    Spark + Cassandra •After you have performed calculations on the DStream object provided by Spark, persist to Cassandra • Simply call .saveToCassandra(Table,Keyspace)
  • 38.
    Spark + Cassandra •What about ingest? • Enter Kafka
  • 39.
    What is Kafka •Unified platform for handling message feeds (aka message bus) • High Volume • Derived Feeds • Support large feeds from offline ingest • Low latency messaging • Fault tolerance during machine failure
  • 40.
    What is Kafka •Publish / Subscribe message architecture • Consumers “receive” messages • Publishers “send” messages • Message routing determined based on “Topic” • Topics are split into partitions which are replicated
  • 41.
  • 42.
  • 43.
    A quick casestudy • Kafka at LinkedIn • 15 broken • 15,500 partitions (replication factor 2) • 400,000 msg/s • event processing
  • 44.
    Spark + Cassandra+ Kafka • Why Kafka? • Pluggable receivers for MQTT and HTTP ingest • Spark Streaming can consume directly from a Kafka Queue and write to Cassandra
  • 45.
    Spark + Cassandra+ Kafka Putting it all together: Kafka cluster Master 0 4 28 Wor ker Wor ker Wor ker Wor ker MQTT bridge
  • 46.
    Spark + Cassandra+ Kafka Putting it all together: Kafka cluster Master 0 4 28 Wor ker Wor ker Wor ker Wor ker MQTT bridge Lambda Architecture!
  • 47.
    • A highlydistributed, resilient and highly available ingest, compute and storage engine. • Leverage additional spark libraries to add capabilities to your project. • Bonus: SparkML, bringing machine learning and artificial intelligence to your data pipeline. Spark + Cassandra + Kafka
  • 48.