Data Analytics with Apache Spark and Cassandra

#bigdatabe @maasg
Data Analytics with Apache
BigData.be Meetup 8/Sep/2015
Gerard Maas @maasg
Data Processing Team Lead
and

#bigdatabe @maasg
Tweet few keywords about your interests and
experience.
Use hashtag “#bigdatabe”

@maasg#bigdatabe
Agenda
Motivation
Sparkling Refreshment
Quick Cassandra Overview
Connecting the Dots . . .
Examples
Resources

@maasg#bigdatabe
Memory CPU’sNetwork

@maasg#bigdatabe
What is Apache Spark?
Spark is a fast and general engine for large-scale distributed data processing.
Fast Functional
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.
split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Growing
Ecosystem

@maasg#bigdatabe
The Big Idea...
Express computations in terms of transformations and actions
on a distributed data set.
Spark Core Concept: RDD => Resilient Distributed Dataset
Think of an RDD as an immutable, distributed collection of objects
• Resilient => Can be reconstructed in case of failure
• Distributed => Transformations are parallelizable operations
• Dataset => Data loaded and partitioned across cluster nodes (executors)
RDDs are memory-intensive. Caching behavior is controllable.

@maasg#bigdatabe
RDDs
RDD
PartitionsPartitionsPartitions

@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...")

@maasg#bigdatabe
RDDs
.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

@maasg#bigdatabe
RDDs
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.reduceByKey(_ + _)
2
4
1
1
2
2
2
1
3
1
2
1

@maasg#bigdatabe
RDDs
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.reduceByKey(_ + _)
2
4
1
1
2
2
2
1
3
1
2
1
7
5
7
3

@maasg#bigdatabe
RDDs
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.reduceByKey(_ + _)
2
4
1
1
2
2
2
1
3
1
2
1
7
5
7
7
5
3
7
3

@maasg#bigdatabe
RDD Lineage
Each RDDs keeps track of its parent.
This is the basis for DAG scheduling
and fault recovery
val file = spark.textFile("hdfs://...")
val wordsRDD = file.flatMap(line => line.split
(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
val scoreRdd = words.map{case (k,v) => (v,k)}
HadoopRDD
MappedRDD
FlatMappedRDD
MappedRDD
MapPartitionsRDD
ShuffleRDD
wordsRDD MapPartitionsRDD
MappedRDDscoreRDD
rdd.toDebugString is your friend

@maasg#bigdatabe
What is Apache Cassandra?
Cassandra is a distributed, high performance, scalable and fault tolerant column-
oriented “noSQL” database.
Bigtable
Data Model
- wide rows, sparse arrays
- high write throughput
DynamoDB
Infrastructure
- P2P gossip
- “kv” store
- Tunable consistency

@maasg#bigdatabe
Cassandra Architecture Nodes use gossip to
communicate ring state
Data is distributed
over
the cluster
Each node is responsible for a
segment of tokens
Data is replicated to n
(configurable) nodes

@maasg#bigdatabe
CREATE TABLE meetup.tweets(
handle TEXT,
ts TIMESTAMP,
txt TEXT,
PRIMARY KEY (handle, ts)
);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);
maasg 1441709070 working on my
presentation

@maasg#bigdatabe
handle TEXT,
ts TIMESTAMP,
txt TEXT,
);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);
presentation
peter_v 1441721070 meetup
tonight!!!

@maasg#bigdatabe
handle TEXT,
ts TIMESTAMP,
txt TEXT,
);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441719110, “almost ready”);
presentation
1441719110 almost ready
tonight!!!

@maasg#bigdatabe
presentation
1441719110 almost ready
tonight!!!
...
Partition Key
Clustering Key

@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
presentation
Murmur3Hash(“maasg”) = 451

@maasg#bigdatabe
1000
000
200
400
600
800
presentation

@maasg#bigdatabe
1000
000
200
400
600
800
presentation
tonight!!!
Murmur3Hash(“peter_v”) = 42

@maasg#bigdatabe
1000
000
200
400
600
800
maasg 144170907
0
working on my
presentation
tonight!!!

@maasg#bigdatabe
1000
000
200
400
600
800
presentation
tonight!!!
1441719110 almost
ready

@maasg#bigdatabe
+
Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector

@maasg#bigdatabe
“This library lets you expose Cassandra tables as Spark RDDs, write Spark
RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark
applications.”

@maasg#bigdatabe
Developer earnings vs tech skills

@maasg#bigdatabe
RDD
1000
000
200
400
600
800

@maasg#bigdatabe
RDD
1000
000
200
400
600
800
cassandraTable, joinWithCassandraTable
repartitionByCassandraReplica

@maasg#bigdatabe
Spark Notebook Software: https://github.com/andypetrella/spark-notebook
Meetup Notebooks: https://github.com/maasg/spark-notebooks

@maasg#bigdatabe
Resources
Project website: http://spark.apache.org/
Spark presentations: http://spark-summit.org/2015
Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark
More Advanced Questions: user@spark.apache.org
Source Code: https://github.com/apache/spark
Getting involved: http://spark.apache.org/community.html

@maasg#bigdatabe
Resources
Project website: http://cassandra.apache.org/
Community Site: www.planetcassandra.org
Questions: http://stackoverflow.com/questions/tagged/cassandra
Training: https://academy.datastax.com/
Spark Cassandra Connector: https://github.com/datastax/spark-cassandra-
connector
Excellent deep-dive in data locality implementation:
http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-
spitzer-1

@maasg#bigdatabe
Resources
Spark-Notebook: https://github.com/andypetrella/spark-notebook
Meetup code: https://github.com/maasg/spark-notebooks
Slides (soon): http://www.virdata.com/category/tech/

@maasg#bigdatabe
Acknowledgments

@maasg#bigdatabe
Want to work with
this exciting tech?
We are hiring!

Data Analytics with Apache Spark and Cassandra

More Related Content

What's hot

Viewers also liked

Similar to Data Analytics with Apache Spark and Cassandra

Recently uploaded

Data Analytics with Apache Spark and Cassandra