SlideShare a Scribd company logo
#CassandraSummit 2014 
A Journey 
● Solving a problem for a specific use case 
● Implementation 
● Example Code
#CassandraSummit 2014 
Person API
#CassandraSummit 2014 
Goal: Analytics on Cassandra Data 
● How many profile types? 
● How many profiles have social data and what type? (facebook, twitter, etc) 
● How many total social profiles of each type? 
● Whatever!
#CassandraSummit 2014 
Key Factors 
● Netflix Priam for Backups (Snapshots, Compressed) 
● Size-Tiered Compaction (SSTables 200 GB+) 
● Compression enabled (SnappyCompressor) 
● AWS
#CassandraSummit 2014 
Where we started
#CassandraSummit 2014 
Limiting Factors 
● 3-10 days total processing time 
● $2700+ in AWS resources 
● Ad-Hoc analytics (not really!) 
● Engineering time!
#CassandraSummit 2014 
Moving Forward 
● Querying Cassandra directly didn’t scale for MapReduce. 
● Cassandra SSTables. Could we consume them directly? 
● SSTables would need to be directly available (HDFS). 
● SSTables would need to be available as MapReduce input. 
● Did something already exist to do this?
#CassandraSummit 2014 
Netflix Aegisthus 
● We already use Netflix Priam for Cassandra backups 
● Aegisthus works great for the Netflix use case: (C* 1.0, No compression) 
● At the time there was an experimental C* 1.2 branch. 
● Aegisthus splits only when compression is not enabled. 
● Single thread processing 200 GB+ SSTables.
#CassandraSummit 2014 
KassandraMRHelper 
● Support for C* 1.2! 
● We got the job done with KassandraMRHelper 
● Copies SSTables to local file system in order to leverage existing C* I/O 
libraries. 
● InputFormat not splittable. 
● Single thread processing 200 GB+ SSTables. 
● 60+ hours to process
#CassandraSummit 2014 
Existing Solutions
#CassandraSummit 2014 
Implementing a Splittable InputFormat 
● We needed splittable SSTables to make this work. 
● With compression enabled this is more difficult. 
● Cassandra I/O code makes the compression seamless but doesn’t support 
HDFS. 
● Need a way to define the splits.
#CassandraSummit 2014 
Our Approach 
● Leverage the SSTable metadata. 
● Adapt Cassandra I/O libraries to HDFS. 
● Leverage the SSTable Index to define splits. IndexIndex! 
● Implement an InputFormat which leverages the IndexIndex to define splits. 
● Similar to Hadoop LZO implementation.
#CassandraSummit 2014 
Cassandra SSTables 
Data file: This file contains the actual SSTable data. A binary format of key/ 
value row data. 
Index file: This file contains an index into the data file for each row key. 
CompressionInfo file: This file contains an index into the data file for each 
compressed block. This file is available when compression has been enabled 
for a Cassandra column family.
#CassandraSummit 2014 
Cassandra I/O for HDFS 
● Cassandra’s I/O allows for random access of the SSTable. 
● Porting this code to HDFS allowed us to read the SSTable in the same 
fashion directly within MapReduce.
#CassandraSummit 2014 
The IndexIndex
#CassandraSummit 2014 
Original Solution
#CassandraSummit 2014 
Final Solution
#CassandraSummit 2014 
Results 
Reading via live queries to Cassandra 3-10 days $2700+ 
Unsplittable SSTable input format 60 hours $350+ 
Splittable SSTable input format 10 hours $165+
#CassandraSummit 2014 
Example
#CassandraSummit 2014 
Mapper 
AbstractType 
keyType 
= 
CompositeType.getInstance(Lists.<AbstractType<?>>newArrayList(UTF8Type.instance, 
UTF8Type.instance)); 
protected 
void 
map(ByteBuffer 
key, 
SSTableIdentityIterator 
value, 
Context 
context) 
throws 
IOException, 
InterruptedException 
{ 
final 
ByteBuffer 
newBuffer 
= 
key.slice(); 
final 
Text 
mapKey 
= 
new 
Text(keyType.getString(newBuffer)); 
Text 
mapValue 
= 
jsonColumnParser.getJson(value, 
context); 
if 
(mapValue 
== 
null) 
{ 
return; 
} 
context.write(mapKey, 
mapValue); 
}
#CassandraSummit 2014 
Reducer 
protected 
void 
reduce(Text 
key, 
Iterable<Text> 
values, 
Context 
context) 
throws 
IOException, 
InterruptedException 
{ 
// 
Make 
things 
super 
simple 
and 
output 
the 
first 
value 
only. 
// 
In 
reality 
we'd 
want 
to 
figure 
out 
which 
was 
the 
// 
most 
correct 
value 
of 
the 
ones 
we 
have 
based 
on 
our 
C* 
cluster 
configuration. 
context.write(key, 
new 
Text(values.iterator().next().toString())); 
}
#CassandraSummit 2014 
Job Configuration 
job.setMapperClass(SimpleExampleMapper.class); 
job.setReducerClass(SimpleExampleReducer.class); 
... 
job.setInputFormatClass(SSTableRowInputFormat.class); 
... 
SSTableInputFormat.addInputPaths(job, 
inputPaths); 
... 
FileOutputFormat.setOutputPath(job, 
new 
Path(outputPath));
#CassandraSummit 2014 
Running the indexer 
hadoop 
jar 
hadoop-­‐sstable-­‐0.1.2.jar 
com.fullcontact.sstable.index.SSTableIndexIndexer 
[SSTABLE_ROOT]
#CassandraSummit 2014 
Running the job 
hadoop 
jar 
hadoop-­‐sstable-­‐0.1.2.jar 
com.fullcontact.sstable.example.SimpleExample 
 
-­‐D 
hadoop.sstable.cql="CREATE 
TABLE 
..." 
 
-­‐D 
mapred.map.tasks.speculative.execution=false 
 
-­‐D 
mapred.job.reuse.jvm.num.tasks=1 
 
-­‐D 
io.sort.mb=1000 
 
-­‐D 
io.sort.factor=100 
 
-­‐D 
mapred.reduce.tasks=512 
 
-­‐D 
hadoop.sstable.split.mb=1024 
 
-­‐D 
mapred.child.java.opts="-­‐Xmx2G 
-­‐XX:MaxPermSize=256m" 
[SSTABLE_ROOT] 
[OUTPUT_PATH]
#CassandraSummit 2014 
Example Summary 
1. Write SSTable reader MapReduce jobs 
2. Run the SSTable Indexer 
3. Run SSTable reader MapReduce jobs
#CassandraSummit 2014 
Goal Accomplished 
● 96% decrease in processing times! 
● 94% decrease in resource costs! 
● Reduced Engineering time!
#CassandraSummit 2014 
Open Source Project 
Open Source @ https://github.com/fullcontact/hadoop-sstable 
Roadmap: 
● Cassandra 2.1 support 
● Scalding support

More Related Content

What's hot

memcached Distributed Cache
memcached Distributed Cachememcached Distributed Cache
memcached Distributed Cache
Aniruddha Chakrabarti
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
Dave Gardner
 
SSTable Reader Cassandra Day Denver 2014
SSTable Reader Cassandra Day Denver 2014SSTable Reader Cassandra Day Denver 2014
SSTable Reader Cassandra Day Denver 2014
Ben Vanberg
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Node.js and Cassandra
Node.js and CassandraNode.js and Cassandra
Node.js and Cassandra
Stratio
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
Caleb Rackliffe
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Cassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra InternalsCassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra Internals
DataStax
 
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc.
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
Chris Westin
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation Factory
Nathan Milford
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
Michael Wynholds
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
DataStax
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
Pierre Baillet
 
Background Tasks in Node - Evan Tahler, TaskRabbit
Background Tasks in Node - Evan Tahler, TaskRabbitBackground Tasks in Node - Evan Tahler, TaskRabbit
Background Tasks in Node - Evan Tahler, TaskRabbit
Redis Labs
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
DataStax Academy
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Rick Branson
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Maarten Smeets
 

What's hot (20)

memcached Distributed Cache
memcached Distributed Cachememcached Distributed Cache
memcached Distributed Cache
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
SSTable Reader Cassandra Day Denver 2014
SSTable Reader Cassandra Day Denver 2014SSTable Reader Cassandra Day Denver 2014
SSTable Reader Cassandra Day Denver 2014
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Node.js and Cassandra
Node.js and CassandraNode.js and Cassandra
Node.js and Cassandra
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra InternalsCassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra Internals
 
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation Factory
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
 
Background Tasks in Node - Evan Tahler, TaskRabbit
Background Tasks in Node - Evan Tahler, TaskRabbitBackground Tasks in Node - Evan Tahler, TaskRabbit
Background Tasks in Node - Evan Tahler, TaskRabbit
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Viewers also liked

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
DataStax Academy
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
DataStax
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1
jbellis
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best Friends
DataStax Academy
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
DataStax Academy
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
DataStax Academy
 
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
DataStax Academy
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
DataStax Academy
 
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
DataStax Academy
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
DataStax Academy
 
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
DataStax Academy
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
DataStax Academy
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
DataStax Academy
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
DataStax Academy
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
Brian Hess
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
Jon Haddad
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
Jon Haddad
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
Jon Haddad
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
Jon Haddad
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 

Viewers also liked (20)

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best Friends
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
 
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
 
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
 
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 

Similar to Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
Tom Grek
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
prevota
 
Using Spring Data and MongoDB with Cloud Foundry
Using Spring Data and MongoDB with Cloud FoundryUsing Spring Data and MongoDB with Cloud Foundry
Using Spring Data and MongoDB with Cloud Foundry
Chris Harris
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
Joseph Adler
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 

Similar to Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis (20)

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Using Spring Data and MongoDB with Cloud Foundry
Using Spring Data and MongoDB with Cloud FoundryUsing Spring Data and MongoDB with Cloud Foundry
Using Spring Data and MongoDB with Cloud Foundry
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
DataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 

Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data Analysis

  • 1. #CassandraSummit 2014 A Journey ● Solving a problem for a specific use case ● Implementation ● Example Code
  • 3. #CassandraSummit 2014 Goal: Analytics on Cassandra Data ● How many profile types? ● How many profiles have social data and what type? (facebook, twitter, etc) ● How many total social profiles of each type? ● Whatever!
  • 4. #CassandraSummit 2014 Key Factors ● Netflix Priam for Backups (Snapshots, Compressed) ● Size-Tiered Compaction (SSTables 200 GB+) ● Compression enabled (SnappyCompressor) ● AWS
  • 6. #CassandraSummit 2014 Limiting Factors ● 3-10 days total processing time ● $2700+ in AWS resources ● Ad-Hoc analytics (not really!) ● Engineering time!
  • 7. #CassandraSummit 2014 Moving Forward ● Querying Cassandra directly didn’t scale for MapReduce. ● Cassandra SSTables. Could we consume them directly? ● SSTables would need to be directly available (HDFS). ● SSTables would need to be available as MapReduce input. ● Did something already exist to do this?
  • 8. #CassandraSummit 2014 Netflix Aegisthus ● We already use Netflix Priam for Cassandra backups ● Aegisthus works great for the Netflix use case: (C* 1.0, No compression) ● At the time there was an experimental C* 1.2 branch. ● Aegisthus splits only when compression is not enabled. ● Single thread processing 200 GB+ SSTables.
  • 9. #CassandraSummit 2014 KassandraMRHelper ● Support for C* 1.2! ● We got the job done with KassandraMRHelper ● Copies SSTables to local file system in order to leverage existing C* I/O libraries. ● InputFormat not splittable. ● Single thread processing 200 GB+ SSTables. ● 60+ hours to process
  • 11. #CassandraSummit 2014 Implementing a Splittable InputFormat ● We needed splittable SSTables to make this work. ● With compression enabled this is more difficult. ● Cassandra I/O code makes the compression seamless but doesn’t support HDFS. ● Need a way to define the splits.
  • 12. #CassandraSummit 2014 Our Approach ● Leverage the SSTable metadata. ● Adapt Cassandra I/O libraries to HDFS. ● Leverage the SSTable Index to define splits. IndexIndex! ● Implement an InputFormat which leverages the IndexIndex to define splits. ● Similar to Hadoop LZO implementation.
  • 13. #CassandraSummit 2014 Cassandra SSTables Data file: This file contains the actual SSTable data. A binary format of key/ value row data. Index file: This file contains an index into the data file for each row key. CompressionInfo file: This file contains an index into the data file for each compressed block. This file is available when compression has been enabled for a Cassandra column family.
  • 14. #CassandraSummit 2014 Cassandra I/O for HDFS ● Cassandra’s I/O allows for random access of the SSTable. ● Porting this code to HDFS allowed us to read the SSTable in the same fashion directly within MapReduce.
  • 18. #CassandraSummit 2014 Results Reading via live queries to Cassandra 3-10 days $2700+ Unsplittable SSTable input format 60 hours $350+ Splittable SSTable input format 10 hours $165+
  • 20. #CassandraSummit 2014 Mapper AbstractType keyType = CompositeType.getInstance(Lists.<AbstractType<?>>newArrayList(UTF8Type.instance, UTF8Type.instance)); protected void map(ByteBuffer key, SSTableIdentityIterator value, Context context) throws IOException, InterruptedException { final ByteBuffer newBuffer = key.slice(); final Text mapKey = new Text(keyType.getString(newBuffer)); Text mapValue = jsonColumnParser.getJson(value, context); if (mapValue == null) { return; } context.write(mapKey, mapValue); }
  • 21. #CassandraSummit 2014 Reducer protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { // Make things super simple and output the first value only. // In reality we'd want to figure out which was the // most correct value of the ones we have based on our C* cluster configuration. context.write(key, new Text(values.iterator().next().toString())); }
  • 22. #CassandraSummit 2014 Job Configuration job.setMapperClass(SimpleExampleMapper.class); job.setReducerClass(SimpleExampleReducer.class); ... job.setInputFormatClass(SSTableRowInputFormat.class); ... SSTableInputFormat.addInputPaths(job, inputPaths); ... FileOutputFormat.setOutputPath(job, new Path(outputPath));
  • 23. #CassandraSummit 2014 Running the indexer hadoop jar hadoop-­‐sstable-­‐0.1.2.jar com.fullcontact.sstable.index.SSTableIndexIndexer [SSTABLE_ROOT]
  • 24. #CassandraSummit 2014 Running the job hadoop jar hadoop-­‐sstable-­‐0.1.2.jar com.fullcontact.sstable.example.SimpleExample -­‐D hadoop.sstable.cql="CREATE TABLE ..." -­‐D mapred.map.tasks.speculative.execution=false -­‐D mapred.job.reuse.jvm.num.tasks=1 -­‐D io.sort.mb=1000 -­‐D io.sort.factor=100 -­‐D mapred.reduce.tasks=512 -­‐D hadoop.sstable.split.mb=1024 -­‐D mapred.child.java.opts="-­‐Xmx2G -­‐XX:MaxPermSize=256m" [SSTABLE_ROOT] [OUTPUT_PATH]
  • 25. #CassandraSummit 2014 Example Summary 1. Write SSTable reader MapReduce jobs 2. Run the SSTable Indexer 3. Run SSTable reader MapReduce jobs
  • 26. #CassandraSummit 2014 Goal Accomplished ● 96% decrease in processing times! ● 94% decrease in resource costs! ● Reduced Engineering time!
  • 27. #CassandraSummit 2014 Open Source Project Open Source @ https://github.com/fullcontact/hadoop-sstable Roadmap: ● Cassandra 2.1 support ● Scalding support