An efficient data mining solution by integrating Spark and Cassandra

AN EFFICIENT DATA MINING SOLUTION

Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell

#StratioB

Goals
•
•
•
•

#StratioB

Why do you need Cassandra?
What is the problem?
Why do you need Spark?
How do they work together?

Cassandra
•
•
•
•

#StratioB

Based on DynamoDB…
Replication, Key/Value, P2P
And based on Big Table…
Column oriented

NO
BOTTLENECK

DECENTRALIZED

REPLICATE
D

Case A

One User – Lot of
data
#StratioB

Case B

Many User – Few data
#StratioB

Case C

Many user – Lot of
data
#StratioB

Crawler app
100
M
Indexed
pages

3k
reads

Cassandra, I choose you

#StratioB

Query time

< 1s

New query
“I need to find all the reference to the domain
ACME. I need the answer by Friday.”

#StratioB

Problem
Cassandra is not well suited to resolved this
type of queries
You need to design the schema with the query
in mind

#StratioB

What options do we have?

•
•

•

#StratioB

Run Hive Query on top of C*
Write an ETL script and load data into
another DB
Clone the cluster

What options do we have?

Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster

#StratioB

And now… what can we do?

“We can't solve problems by using the same kind of thinking
we used when we created them”

Albert Einstein

#StratioB

Spark
•
•
•
•
•

Alternative to MapReduce
A low latency cluster computing system
For very large datasets
Create by UC Berkeley AMP Lab in 2010.
May be 100 times faster than MapReduce for:



#StratioB

Interactive algorithms.
Interactive data mining

Logistic regression in
Spark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioB

Spark and
Cassandra
Integration points

#StratioB

Cassandra’s HDFS abstraction
layer
Advantantages:
•

Easily integrates with legacy systems.

Drawbacks:
•
•

Very high-level: no access to low level Cassandra’s features.
Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioB

Cassandra’s Hadoop
Interface
•

Thrift protocol

•

CQL3 (our implementation)


Uses the novel Cassandra’s
CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioB

CQL3 Integration
•
•
•

Supports CQL3 features
Respects data locality
Good compromise between
performance / implementation complexity

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioB

CQL3 Integration (II)
Provides a Java friendly API:
•

Developers map Column Families to custom serializable
POJOs

•

StratioDeep wraps the complexity of performing Spark
calculations directly over the user provided POJOs.


#StratioB

CQL3 Integration (III)
Drawbacks:
•

Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface
No analyst-friendly interface:


•



#StratioB

No SQL-like query features


Future extensions
What are we currently working on?
Bring the integration to another level:
•
•
•

#StratioB

Dump Cassandra’s Hadoop Interface
Direct access to Cassandra’s SSTable(s) files.
Extend Cassandra’s CQL3 to make use of Spark’s
distributed data processing power

An efficient data mining solution by integrating Spark and Cassandra

More Related Content

What's hot

Similar to An efficient data mining solution by integrating Spark and Cassandra

More from Stratio

Recently uploaded

An efficient data mining solution by integrating Spark and Cassandra