AN EFFICIENT DATA MINING SOLUTION
Hadoop?
Cassandra?
Spark?
Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell

#StratioB
Goals
•
•
•
•

#StratioB

Why do you need Cassandra?
What is the problem?
Why do you need Spark?
How do they work together?
Cassandra
•
•
•
•

#StratioB

Based on DynamoDB…
Replication, Key/Value, P2P
And based on Big Table…
Column oriented
ROBUST

FAST

EFFICENT
NO
BOTTLENECK

DECENTRALIZED

REPLICATE
D
Another
Databas
e?
Why?
Case A

One User – Lot of
data
#StratioB
Case B

Many User – Few data
#StratioB
Case C

Many user – Lot of
data
#StratioB
Crawler app
100
M
Indexed
pages

3k
reads

Cassandra, I choose you

#StratioB

Query time

< 1s
But…
Marketing
walks in
New query
“I need to find all the reference to the domain
ACME. I need the answer by Friday.”

#StratioB
Problem
Cassandra is not well suited to resolved this
type of queries
You need to design the schema with the query
in mind

#StratioB
Challenge
Accepted
What options do we have?

•
•

•

#StratioB

Run Hive Query on top of C*
Write an ETL script and load data into
another DB
Clone the cluster
What options do we have?

Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster

#StratioB
And now… what can we do?

“We can't solve problems by using the same kind of thinking
we used when we created them”

Albert Einstein

#StratioB
Spark
•
•
•
•
•

Alternative to MapReduce
A low latency cluster computing system
For very large datasets
Create by UC Berkeley AMP Lab in 2010.
May be 100 times faster than MapReduce for:



#StratioB

Interactive algorithms.
Interactive data mining
Logistic regression in
Spark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioB
WHO USES SPARK?
Spark and
Cassandra
Integration points

#StratioB
Cassandra’s HDFS abstraction
layer
Advantantages:
•

Easily integrates with legacy systems.

Drawbacks:
•
•

Very high-level: no access to low level Cassandra’s features.
Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioB
Cassandra’s Hadoop
Interface
•

Thrift protocol

•

CQL3 (our implementation)


Uses the novel Cassandra’s
CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioB
CQL3 Integration
•
•
•

Supports CQL3 features
Respects data locality
Good compromise between
performance / implementation complexity

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioB
CQL3 Integration (II)
Provides a Java friendly API:
•

Developers map Column Families to custom serializable
POJOs

•

StratioDeep wraps the complexity of performing Spark
calculations directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioB
Demo
CQL3 Integration (III)
Drawbacks:
•

Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface
No analyst-friendly interface:


•



#StratioB

No SQL-like query features

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
Future extensions
What are we currently working on?
Bring the integration to another level:
•
•
•

#StratioB

Dump Cassandra’s Hadoop Interface
Direct access to Cassandra’s SSTable(s) files.
Extend Cassandra’s CQL3 to make use of Spark’s
distributed data processing power
Conclusion

#StratioB
THANKS

An efficient data mining solution by integrating Spark and Cassandra