#bigdatabe @maasg
Data Analytics with Apache
BigData.be Meetup 8/Sep/2015
Gerard Maas @maasg
Data Processing Team Lead
and
#bigdatabe @maasg
#bigdatabe @maasg
Tweet few keywords about your interests and
experience.
Use hashtag “#bigdatabe”
@maasg#bigdatabe
Agenda
Motivation
Sparkling Refreshment
Quick Cassandra Overview
Connecting the Dots . . .
Examples
Resources
@maasg#bigdatabe
Scalability
@maasg#bigdatabe
Availability
@maasg#bigdatabe
Resilience
@maasg#bigdatabe
@maasg#bigdatabe
Memory CPU’sNetwork
@maasg#bigdatabe
What is Apache Spark?
Spark is a fast and general engine for large-scale distributed data processing.
Fast Functional
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.
split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Growing
Ecosystem
@maasg#bigdatabe
The Big Idea...
Express computations in terms of transformations and actions
on a distributed data set.
Spark Core Concept: RDD => Resilient Distributed Dataset
Think of an RDD as an immutable, distributed collection of objects
• Resilient => Can be reconstructed in case of failure
• Distributed => Transformations are parallelizable operations
• Dataset => Data loaded and partitioned across cluster nodes (executors)
RDDs are memory-intensive. Caching behavior is controllable.
@maasg#bigdatabe
RDDs
RDD
PartitionsPartitionsPartitions
@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...")
@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.reduceByKey(_ + _)
2
4
1
1
2
2
2
1
3
1
2
1
@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.reduceByKey(_ + _)
2
4
1
1
2
2
2
1
3
1
2
1
7
5
7
3
@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.reduceByKey(_ + _)
2
4
1
1
2
2
2
1
3
1
2
1
7
5
7
7
5
3
7
3
@maasg#bigdatabe
RDD Lineage
Each RDDs keeps track of its parent.
This is the basis for DAG scheduling
and fault recovery
val file = spark.textFile("hdfs://...")
val wordsRDD = file.flatMap(line => line.split
(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
val scoreRdd = words.map{case (k,v) => (v,k)}
HadoopRDD
MappedRDD
FlatMappedRDD
MappedRDD
MapPartitionsRDD
ShuffleRDD
wordsRDD MapPartitionsRDD
MappedRDDscoreRDD
rdd.toDebugString is your friend
@maasg#bigdatabe
What is Apache Cassandra?
Cassandra is a distributed, high performance, scalable and fault tolerant column-
oriented “noSQL” database.
Bigtable
Data Model
- wide rows, sparse arrays
- high write throughput
DynamoDB
Infrastructure
- P2P gossip
- “kv” store
- Tunable consistency
@maasg#bigdatabe
Cassandra Architecture Nodes use gossip to
communicate ring state
Data is distributed
over
the cluster
Each node is responsible for a
segment of tokens
Data is replicated to n
(configurable) nodes
@maasg#bigdatabe
CREATE TABLE meetup.tweets(
handle TEXT,
ts TIMESTAMP,
txt TEXT,
PRIMARY KEY (handle, ts)
);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);
maasg 1441709070 working on my
presentation
@maasg#bigdatabe
CREATE TABLE meetup.tweets(
handle TEXT,
ts TIMESTAMP,
txt TEXT,
PRIMARY KEY (handle, ts)
);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);
maasg 1441709070 working on my
presentation
peter_v 1441721070 meetup
tonight!!!
@maasg#bigdatabe
CREATE TABLE meetup.tweets(
handle TEXT,
ts TIMESTAMP,
txt TEXT,
PRIMARY KEY (handle, ts)
);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441719110, “almost ready”);
maasg 1441709070 working on my
presentation
1441719110 almost ready
peter_v 1441721070 meetup
tonight!!!
@maasg#bigdatabe
maasg 1441709070 working on my
presentation
1441719110 almost ready
peter_v 1441721070 meetup
tonight!!!
...
Partition Key
Clustering Key
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
maasg 1441709070 working on my
presentation
Murmur3Hash(“maasg”) = 451
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
maasg 1441709070 working on my
presentation
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
maasg 1441709070 working on my
presentation
peter_v 1441721070 meetup
tonight!!!
Murmur3Hash(“peter_v”) = 42
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
maasg 144170907
0
working on my
presentation
peter_v 1441721070 meetup
tonight!!!
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
maasg 1441709070 working on my
presentation
peter_v 1441721070 meetup
tonight!!!
1441719110 almost
ready
@maasg#bigdatabe
+
Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
@maasg#bigdatabe
“This library lets you expose Cassandra tables as Spark RDDs, write Spark
RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark
applications.”
@maasg#bigdatabe
Developer earnings vs tech skills
@maasg#bigdatabe
RDD
PartitionsPartitionsPartitions
1000
000
200
400
600
800
@maasg#bigdatabe
RDD
PartitionsPartitionsPartitions
1000
000
200
400
600
800
cassandraTable, joinWithCassandraTable
repartitionByCassandraReplica
@maasg#bigdatabe
Examples
@maasg#bigdatabe
Spark Notebook Software: https://github.com/andypetrella/spark-notebook
Meetup Notebooks: https://github.com/maasg/spark-notebooks
@maasg#bigdatabe
Resources
Project website: http://spark.apache.org/
Spark presentations: http://spark-summit.org/2015
Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark
More Advanced Questions: user@spark.apache.org
Source Code: https://github.com/apache/spark
Getting involved: http://spark.apache.org/community.html
@maasg#bigdatabe
Resources
Project website: http://cassandra.apache.org/
Community Site: www.planetcassandra.org
Questions: http://stackoverflow.com/questions/tagged/cassandra
Training: https://academy.datastax.com/
Spark Cassandra Connector: https://github.com/datastax/spark-cassandra-
connector
Excellent deep-dive in data locality implementation:
http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-
spitzer-1
@maasg#bigdatabe
Resources
Spark-Notebook: https://github.com/andypetrella/spark-notebook
Meetup code: https://github.com/maasg/spark-notebooks
Slides (soon): http://www.virdata.com/category/tech/
@maasg#bigdatabe
Acknowledgments
@maasg#bigdatabe
Want to work with
this exciting tech?
We are hiring!

Data Analytics with Apache Spark and Cassandra