Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis

#CassandraSummit 2014
A Journey
● Solving a problem for a specific use case
● Implementation
● Example Code

Person API

Goal: Analytics on Cassandra Data
● How many profile types?
● How many profiles have social data and what type? (facebook, twitter, etc)
● How many total social profiles of each type?
● Whatever!

Key Factors
● Netflix Priam for Backups (Snapshots, Compressed)
● Size-Tiered Compaction (SSTables 200 GB+)
● Compression enabled (SnappyCompressor)
● AWS

Where we started

Limiting Factors
● 3-10 days total processing time
● $2700+ in AWS resources
● Ad-Hoc analytics (not really!)
● Engineering time!

Moving Forward
● Querying Cassandra directly didn’t scale for MapReduce.
● Cassandra SSTables. Could we consume them directly?
● SSTables would need to be directly available (HDFS).
● SSTables would need to be available as MapReduce input.
● Did something already exist to do this?

Netflix Aegisthus
● We already use Netflix Priam for Cassandra backups
● Aegisthus works great for the Netflix use case: (C* 1.0, No compression)
● At the time there was an experimental C* 1.2 branch.
● Aegisthus splits only when compression is not enabled.
● Single thread processing 200 GB+ SSTables.

KassandraMRHelper
● Support for C* 1.2!
● We got the job done with KassandraMRHelper
● Copies SSTables to local file system in order to leverage existing C* I/O
libraries.
● InputFormat not splittable.
● Single thread processing 200 GB+ SSTables.
● 60+ hours to process

Existing Solutions

Implementing a Splittable InputFormat
● We needed splittable SSTables to make this work.
● With compression enabled this is more difficult.
● Cassandra I/O code makes the compression seamless but doesn’t support
HDFS.
● Need a way to define the splits.

Our Approach
● Leverage the SSTable metadata.
● Adapt Cassandra I/O libraries to HDFS.
● Leverage the SSTable Index to define splits. IndexIndex!
● Implement an InputFormat which leverages the IndexIndex to define splits.
● Similar to Hadoop LZO implementation.

Cassandra SSTables
Data file: This file contains the actual SSTable data. A binary format of key/
value row data.
Index file: This file contains an index into the data file for each row key.
CompressionInfo file: This file contains an index into the data file for each
compressed block. This file is available when compression has been enabled
for a Cassandra column family.

Cassandra I/O for HDFS
● Cassandra’s I/O allows for random access of the SSTable.
● Porting this code to HDFS allowed us to read the SSTable in the same
fashion directly within MapReduce.

The IndexIndex

Original Solution

Final Solution

Results
Reading via live queries to Cassandra 3-10 days $2700+
Unsplittable SSTable input format 60 hours $350+
Splittable SSTable input format 10 hours $165+

Example

Mapper
AbstractType
keyType
=
CompositeType.getInstance(Lists.<AbstractType<?>>newArrayList(UTF8Type.instance,
UTF8Type.instance));
protected
void
map(ByteBuffer
key,
SSTableIdentityIterator
value,
Context
context)
throws
IOException,
InterruptedException
{
final
ByteBuffer
newBuffer
=
key.slice();
final
Text
mapKey
=
new
Text(keyType.getString(newBuffer));
Text
mapValue
=
jsonColumnParser.getJson(value,
context);
if
(mapValue
==
null)
{
return;
}
context.write(mapKey,
mapValue);
}

Reducer
protected
void
reduce(Text
key,
Iterable<Text>
values,
Context
context)
throws
IOException,
InterruptedException
{
//
Make
things
super
simple
and
output
the
first
value
only.
//
In
reality
we'd
want
to
figure
out
which
was
the
//
most
correct
value
of
the
ones
we
have
based
on
our
C*
cluster
configuration.
context.write(key,
new
Text(values.iterator().next().toString()));
}

Job Configuration
job.setMapperClass(SimpleExampleMapper.class);
job.setReducerClass(SimpleExampleReducer.class);
...
job.setInputFormatClass(SSTableRowInputFormat.class);
...
SSTableInputFormat.addInputPaths(job,
inputPaths);
...
FileOutputFormat.setOutputPath(job,
new
Path(outputPath));

Running the indexer
hadoop
jar
hadoop-‐sstable-‐0.1.2.jar
com.fullcontact.sstable.index.SSTableIndexIndexer
[SSTABLE_ROOT]

Running the job
hadoop
jar
hadoop-‐sstable-‐0.1.2.jar
com.fullcontact.sstable.example.SimpleExample

-‐D
hadoop.sstable.cql="CREATE
TABLE
..."

-‐D
mapred.map.tasks.speculative.execution=false

-‐D
mapred.job.reuse.jvm.num.tasks=1

-‐D
io.sort.mb=1000

-‐D
io.sort.factor=100

-‐D
mapred.reduce.tasks=512

-‐D
hadoop.sstable.split.mb=1024

-‐D
mapred.child.java.opts="-‐Xmx2G
-‐XX:MaxPermSize=256m"
[SSTABLE_ROOT]
[OUTPUT_PATH]

Example Summary
1. Write SSTable reader MapReduce jobs
2. Run the SSTable Indexer
3. Run SSTable reader MapReduce jobs

Goal Accomplished
● 96% decrease in processing times!
● 94% decrease in resource costs!
● Reduced Engineering time!

Open Source Project
Open Source @ https://github.com/fullcontact/hadoop-sstable
Roadmap:
● Cassandra 2.1 support
● Scalding support

Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis

Similar to Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis