SlideShare a Scribd company logo
1 of 41
Download to read offline
Online Analytics with
Hadoop and Cassandra
      Robbie Strickland
who am i?


Robbie Strickland
Managing Architect, E3 Greentech

rostrickland@gmail.com
linkedin.com/in/robbiestrickland
github.com/rstrickland
in scope
โ—   offline vs online analytics
โ—   Cassandra overview
โ—   Hadoop-Cassandra integration
โ—   writing map/reduce against Cassandra
โ—   Scala map/reduce
โ—   Pig with Cassandra
โ—   questions
common hadoop usage pattern
โ— batch analysis of unstructured/semi-structured data
โ— analysis of transactional data requires moving it into
    HDFS or offline HBase cluster
โ—   use Sqoop, Flume, or some other ETL process to
    aggregate and pull in data from external sources
โ—   well-suited for high-latency offline analysis
the online use case
โ— need to analyze live data from a transactional DB
โ— too much data or too little time to efficiently move it
    offline before analysis
โ—   need to immediately feed results back into live system
โ—   reduce complexity/improve maintainability of M/R jobs
โ—   run Pig scripts against live data (i.e. data exploration)

Cassandra + Hadoop makes this possible!
what is cassandra?
โ—   open-source NoSQL column store
โ—   distributed hash table
โ—   tunable consistency
โ—   dynamic linear scalability
โ—   masterless, peer-to-peer architecture
โ—   highly fault tolerant
โ—   extremely low latency reads/writes
โ—   keeps active data set in RAM -- tunable
โ—   all writes are sequential
โ—   automatic replication
cassandra data model
โ—   keyspace roughly == database
โ—   column family roughly == table
โ—   each row has a key
โ—   each row is a collection of columns or supercolumns
โ—   a column is a key/value pair
โ—   a supercolumn is a collection of columns
โ—   column names/quantity do not have to be the same
    across rows in the same column family
โ—   keys & column names are often composite
โ—   supports secondary indexes with low cardinality
cassandra data model
keyspace                  keyspace
 standard_column_family    super_column_family
  key1                      key1
    column1 : value1         supercol1
    column2 : value2          column1 : value1
  key2                        column2 : value2
    column1 : value1         supercol2
    column3 : value3          column3 : value3
    column4 : value4        key2
                             supercol3
                              column4 : value4
                             supercol4
                              column5 : value5
let's have a look ...
cassandra query model
โ— get by key(s)
โ— range slice - from key A to key B in sort order
    (partitioner choice matters here)
โ—   column slice - from column A to column B ordered
    according to specified comparator type
โ—   indexed slice - all rows that match a predicate
    (requires a secondary index to be in place for the
    column in the predicate)
โ—   "write it like you intend to read it"
โ—   Cassandra Query Language (CQL)
cassandra topology

                         Node 1

                        Start token
                             0




           Node 2                       Node 3

          Start token                  Start token
           2^127 / 3                  2 * 2^127 / 3
cassandra client access
โ— uses Thrift as the underlying protocol with a number of
    language bindings
โ—   CLI is useful for simple CRUD and schema changes
โ—   high-level clients available for most common
    languages -- this is the recommended approach
โ—   CQL supported by some clients
cassandra with hadoop
โ— supports input via ColumnFamilyInputFormat
โ— supports output via ColumnFamilyOutputFormat
โ— HDFS still used by Hadoop unless you use DataStax
    Enterprise version
โ—   supports Pig input/output via CassandraStorage
โ—   preserves data locality
โ—   use 2 "data centers" -- one for transactional and the
    other for analytics
โ—   replication between data centers is automatic and
    immediate
cassandra with hadoop - topology

                                                            Node 4

               Node 1                                      Start token             Hadoop
                                                             N1 + 1               NameNode
              Start token                                                         JobTracker
                   0                                       DataNode
                                                          TaskTracker
                                                             Replica
                Replica                     Replication

                                                            Node 5                  Node 6
 Node 2                       Node 3
                                                          Start token              Start token
Start token                  Start token                    N2 + 1                   N3 + 1
 2^127 / 3                  2 * 2^127 / 3
                                                           DataNode                DataNode
                               Replica                    TaskTracker             TaskTracker



    Data Center 1 - Transactional                               Data Center 2 - Analytics
configuring transactional nodes
โ— install Cassandra 1.1
โ— config properties are in cassandra.yaml (in
    /etc/cassandra or $CASSANDRA_HOME/conf) and in
    create keyspace options
โ—   placement strategy = NetworkTopologyStrategy
โ—   strategy options = DC1:2, DC2:1
โ—   endpoint snitch = PropertyFileSnitch -- put DC/rack
    specs in cassandra-topology.properties
โ—   designate at least one node as seed(s)
โ—   make sure listen & rpc addresses are correct
โ—   use nodetool -h <host> ring to verify everyone is talking
configuring analytics nodes
โ—   configure Cassandra using transactional instructions
โ—   insure there are no token collisions
โ—   install Hadoop 1.0.1+ TaskTracker & DataNode
โ—   add Cassandra jars & dependencies to Hadoop
    classpath -- either copy them to $HADOOP_HOME/lib
    or add the path in hadoop-env.sh
โ—   Cassandra jars are typically in /usr/share/cassandra,
    and dependencies in /usr/share/cassandra/lib if
    installed using a package manager
โ—   you may need to increase Cassandra RPC timeout
let's have a look ...
writing map/reduce
Assumptions:
โ— Hadoop 1.0.1+
โ— new Hadoop API (i.e. the one in org.apache.hadoop.
    mapreduce, not mapred)
โ—   Cassandra 1.1
job configuration
โ— use Hadoop job configuration API to set general
    Hadoop configuration parameters
โ—   use Cassandra ConfigHelper to set Cassandra-specific
    configuration parameters
job configuration - example
Hadoop config related to Cassandra:

job.setInputFormatClass(ColumnFamilyInputFormat.class)
job.setOutputFormatClass(ColumnFamilyOutputFormat.class)
job.setOutputKeyClass(ByteBuffer.class)
job.setOutputValueClass(List.class)
job configuration - example
Cassandra-specific config:

ConfigHelper.setInputRpcPort(conf, "9160")
ConfigHelper.setInputInitialAddress(conf, cassHost)
ConfigHelper.setInputPartitioner(conf, partitionerClassName)
ConfigHelper.setInputColumnFamily(conf, keyspace, columnFamily)
ConfigHelper.setInputSlicePredicate(conf, someQueryPredicate)
ConfigHelper.setOutputRpcPort(conf, "9160")
ConfigHelper.setOutputInitialAddress(conf, cassHost)
ConfigHelper.setOutputPartitioner(conf, partitionerClassName)
ConfigHelper.setOutputColumnFamily(conf, keyspace, columnFamily)
the mapper
โ— input key is a ByteBuffer
โ— input value is a SortedMap<ByteBuffer, IColumn>,
    where the ByteBuffer is the column name
โ—   use Cassandra ByteBufferUtil to read/write
    ByteBuffers
โ—   map output must be a Writable as in "standard" M/R
example mapper
Group integer values by column name:

for (IColumn col : columns.values())
   context.write(new Text(ByteBufferUtil.string(col.name())),
                 new LongWritable(ByteBufferUtil.toLong(col.value())));
the reducer
โ— input is whatever you output from map
โ— output key is ByteBuffer == row key in Cassandra
โ— output value is List<Mutation> == the columns you want
    to change
โ—   uses Thrift API to define Mutations
example reducer
Compute average of each column's values:
int sum = 0;
int count = 0;
for (LongWritable val : values) {
   sum += val.get();
   count++;
}

Column c = new Column();
c.setName(ByteBufferUtil.bytes("Average"));
c.setValue(ByteBufferUtil.bytes(sum/count));
c.setTimestamp(System.currentTimeMillis());

Mutation m = new Mutation();
m.setColumn_or_supercolumn(new ColumnOrSuperColumn());
m.column_or_supercolumn.setColumn(c);
context.write(ByteBufferUtil.bytes(key.toString()), Collections.singletonList(m));
let's have a look ...
outputting to multiple CFs
โ— currently not supported out of the box
โ— patch available against 1.1 to allow this using Hadoop
    MultipleOutputs API
โ—   not necessary to patch production Cassandra, only the
    Cassandra jars on Hadoop's classpath
โ—   patch will not break existing reducers
โ—   to use:
    โ—‹ git clone Cassandra 1.1 source
    โ—‹ use git apply to apply patch
    โ—‹ build with ant
    โ—‹ put the new jars on Hadoop classpath
multiple output example
// in job configuration
ConfigHelper.setOutputKeyspace(conf, keyspace);

MultipleOutputs.addNamedOutput(job, columnFamily1,
   ColumnFamilyOutputFormat.class, ByteBuffer.class,
   List.class);

MultipleOutputs.addNamedOutput(job, columnFamily2,
   ColumnFamilyOutputFormat.class, ByteBuffer.class,
   List.class);
multiple output example
// in reducer
private MultipleOutputs _output;

public void setup(Context context) {
   _output = new MultipleOutputs(context);
}

public void cleanup(Context context) {
   _output.close();
}
multiple output example
// in reducer - writing to an output

Mutation m1 = ...
Mutation m2 = ...

_output.write(columnFamily1, ByteBufferUtil.bytes(key1),
m1);
_output.write(columnFamily2, ByteBufferUtil.bytes(key2),
m2);
let's have a look ...
map/reduce in scala
why? because this:

values.groupBy(_.name)
       .map { case (name, vals) => name -> vals.map(ByteBufferUtil.toInt).sum }
       .filter { case (name, sum) => sum > 100 }
       .foreach { case (name, sum) =>
          context.write(new Text(ByteBufferUtil.string(name)), new IntWritable(sum))
        }


... would require many more lines of code in Java and be less readable!
map/reduce in scala
... and because we can parallelize it by doing this:

values.par // make this collection parallel
      .groupBy(_.name)
       .map { case (name, vals) => name -> vals.map(ByteBufferUtil.toInt).sum }
       .filter { case (name, sum) => sum > 100 }
       .foreach { case (name, sum) =>
          context.write(new Text(ByteBufferUtil.string(name)), new IntWritable(sum))
        }


(!!!)
map/reduce in scala
โ— add Scala library to Hadoop classpath
โ— import scala.collection.JavaConversions._ to get Scala
    coolness with Java collections
โ—   main executable is an empty class with the
    implementation in a companion object
โ—   context parm in map & reduce methods must be:
    โ—‹   context: Mapper[ByteBuffer, SortedMap[ByteBuffer, IColumn],
        <MapKeyClass>, <MapValueClass>]#Context
    โ—‹   context: Reducer[<MapKeyClass>, <MapValueClass>, ByteBuffer,
        java.util.List[Mutation]]#Context
    โ—‹   it will compile but not function correctly without the type information
let's have a look ...
pig
โ—   allows ad hoc queries against Cassandra
โ—   schema applied at query time
โ—   can run in local mode or using Hadoop cluster
โ—   useful for data exploration, mass updates, back-
    populating data, etc.
โ—   cassandra supported via CassandraStorage
โ—   both LoadFunc and StoreFunc
โ—   now comes pre-built in Cassandra distribution
โ—   pygmalion adds some useful Cassandra UDFs --
    github.com/jeromatron/pygmalion
pig - getting set up
โ— download pig - better not to use package manager
โ— get the pig_cassandra script from Cassandra source
    (it's in examples/pig) -- or I put it on github with all the
    samples
โ—   set environment variables as described in README
    โ—‹   JAVA_HOME = location of your Java install
    โ—‹   PIG_HOME = location of your Pig install
    โ—‹   PIG_CONF_DIR = location of your Hadoop config files
    โ—‹   PIG_INITIAL_ADDRESS = address of one Cassandra analytics node
    โ—‹   PIG_RPC_PORT = Thrift port -- default 9160
    โ—‹   PIG_PARTITIONER = your Cassandra partitioner
pig - loading data
Standard CF:
rows = LOAD 'cassandra://<keyspace>/<column_family>'
        USING CassandraStorage()
        AS (key: bytearray, cols: bag {col: tuple(name,value)});

Super CF:
rows = LOAD 'cassandra://<keyspace>/<column_family>'
         USING CassandraStorage()
         AS (key: bytearray, cols: bag {(supercol, subcols: bag {(name,
value)})});
pig - writing data
STORE command:
STORE <relation_name>
INTO 'cassandra://<keyspace>/<column_family>'
USING CassandraStorage();

Standard CF schema:
(key, {(col_name, value), (col_name, value)});

Super CF schema:
(key, {supercol: {(col_name, value), (col_name, value)},
      {supercol: {(col_name, value), (col_name, value)}});
let's have a look ...
questions?
check out my github for this presentation and code
samples!



Robbie Strickland
Managing Architect, E3 Greentech

rostrickland@gmail.com
linkedin.com/in/robbiestrickland
github.com/rstrickland

More Related Content

What's hot

Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Jon Haddad
ย 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
ย 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
ย 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
ย 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
ย 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
ย 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
ย 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
ย 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
ย 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
ย 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
ย 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
ย 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
ย 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDSATOSHI TAGOMORI
ย 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
ย 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
ย 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemMathias Herberts
ย 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
ย 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
ย 

What's hot (20)

Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
ย 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
ย 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
ย 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
ย 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
ย 
Scala+data
Scala+dataScala+data
Scala+data
ย 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
ย 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
ย 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
ย 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
ย 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
ย 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
ย 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
ย 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
ย 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
ย 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
ย 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
ย 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
ย 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
ย 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
ย 

Viewers also liked

Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop IntegrationJeremy Hanna
ย 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
ย 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
ย 
Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Benoit Perroud
ย 
Big Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraBig Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraRobbie Strickland
ย 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
ย 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
ย 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
ย 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
ย 
Transforming Data Management and Time to Insight with Anzo Smart Data Lakeยฎ
Transforming Data Management and Time to Insight with Anzo Smart Data LakeยฎTransforming Data Management and Time to Insight with Anzo Smart Data Lakeยฎ
Transforming Data Management and Time to Insight with Anzo Smart Data LakeยฎCambridge Semantics
ย 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed DatawarehousingJos van Dongen
ย 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
ย 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
ย 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData
ย 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
ย 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleCambridge Semantics
ย 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraRobbie Strickland
ย 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
ย 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosDataWorks Summit
ย 

Viewers also liked (20)

Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
ย 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
ย 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
ย 
Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14
ย 
Big Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraBig Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to Cassandra
ย 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
ย 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big Data
ย 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
ย 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
ย 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
ย 
Transforming Data Management and Time to Insight with Anzo Smart Data Lakeยฎ
Transforming Data Management and Time to Insight with Anzo Smart Data LakeยฎTransforming Data Management and Time to Insight with Anzo Smart Data Lakeยฎ
Transforming Data Management and Time to Insight with Anzo Smart Data Lakeยฎ
ย 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed Datawarehousing
ย 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
ย 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
ย 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
ย 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
ย 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise Scale
ย 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on Cassandra
ย 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
ย 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
ย 

Similar to Online Analytics with Hadoop and Cassandra

On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
ย 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat ร‡akal
ย 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
ย 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentationIlya Bogunov
ย 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUGStu Hood
ย 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
ย 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101Nader Ganayem
ย 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
ย 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemRan Silberman
ย 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsJulien Anguenot
ย 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
ย 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemRan Silberman
ย 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
ย 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
ย 
Cassandra
CassandraCassandra
CassandraCarbo Kuo
ย 
Cassandra Java APIs Old and New โ€“ A Comparison
Cassandra Java APIs Old and New โ€“ A ComparisonCassandra Java APIs Old and New โ€“ A Comparison
Cassandra Java APIs Old and New โ€“ A Comparisonshsedghi
ย 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
ย 
Redis: REmote DIctionary Server
Redis: REmote DIctionary ServerRedis: REmote DIctionary Server
Redis: REmote DIctionary ServerEzra Zygmuntowicz
ย 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
ย 

Similar to Online Analytics with Hadoop and Cassandra (20)

On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
ย 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
ย 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
ย 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
ย 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
ย 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
ย 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
ย 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
ย 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
ย 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
ย 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
ย 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
ย 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
ย 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
ย 
Cassandra
CassandraCassandra
Cassandra
ย 
Cassandra Java APIs Old and New โ€“ A Comparison
Cassandra Java APIs Old and New โ€“ A ComparisonCassandra Java APIs Old and New โ€“ A Comparison
Cassandra Java APIs Old and New โ€“ A Comparison
ย 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
ย 
Redis: REmote DIctionary Server
Redis: REmote DIctionary ServerRedis: REmote DIctionary Server
Redis: REmote DIctionary Server
ย 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
ย 
Unit 3
Unit 3Unit 3
Unit 3
ย 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
ย 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
ย 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
ย 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
ย 
Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...Alan Dix
ย 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
ย 
FULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | Delhisoniya singh
ย 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
ย 
WhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
ย 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
ย 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
ย 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
ย 
SIEMENS: RAPUNZEL โ€“ A Tale About Knowledge Graph
SIEMENS: RAPUNZEL โ€“ A Tale About Knowledge GraphSIEMENS: RAPUNZEL โ€“ A Tale About Knowledge Graph
SIEMENS: RAPUNZEL โ€“ A Tale About Knowledge GraphNeo4j
ย 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
ย 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
ย 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
ย 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
ย 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
ย 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
ย 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
ย 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
ย 
Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song โ€“ personal research during my six years at Swansea ... and bey...
ย 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
ย 
FULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY ๐Ÿ” 8264348440 ๐Ÿ” Call Girls in Diplomatic Enclave | Delhi
ย 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
ย 
WhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 โœ“Call Girls In Kalyan ( Mumbai ) secure service
ย 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
ย 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
ย 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
ย 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
ย 
SIEMENS: RAPUNZEL โ€“ A Tale About Knowledge Graph
SIEMENS: RAPUNZEL โ€“ A Tale About Knowledge GraphSIEMENS: RAPUNZEL โ€“ A Tale About Knowledge Graph
SIEMENS: RAPUNZEL โ€“ A Tale About Knowledge Graph
ย 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
ย 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
ย 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
ย 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
ย 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
ย 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
ย 

Online Analytics with Hadoop and Cassandra

  • 1. Online Analytics with Hadoop and Cassandra Robbie Strickland
  • 2. who am i? Robbie Strickland Managing Architect, E3 Greentech rostrickland@gmail.com linkedin.com/in/robbiestrickland github.com/rstrickland
  • 3. in scope โ— offline vs online analytics โ— Cassandra overview โ— Hadoop-Cassandra integration โ— writing map/reduce against Cassandra โ— Scala map/reduce โ— Pig with Cassandra โ— questions
  • 4. common hadoop usage pattern โ— batch analysis of unstructured/semi-structured data โ— analysis of transactional data requires moving it into HDFS or offline HBase cluster โ— use Sqoop, Flume, or some other ETL process to aggregate and pull in data from external sources โ— well-suited for high-latency offline analysis
  • 5. the online use case โ— need to analyze live data from a transactional DB โ— too much data or too little time to efficiently move it offline before analysis โ— need to immediately feed results back into live system โ— reduce complexity/improve maintainability of M/R jobs โ— run Pig scripts against live data (i.e. data exploration) Cassandra + Hadoop makes this possible!
  • 6. what is cassandra? โ— open-source NoSQL column store โ— distributed hash table โ— tunable consistency โ— dynamic linear scalability โ— masterless, peer-to-peer architecture โ— highly fault tolerant โ— extremely low latency reads/writes โ— keeps active data set in RAM -- tunable โ— all writes are sequential โ— automatic replication
  • 7. cassandra data model โ— keyspace roughly == database โ— column family roughly == table โ— each row has a key โ— each row is a collection of columns or supercolumns โ— a column is a key/value pair โ— a supercolumn is a collection of columns โ— column names/quantity do not have to be the same across rows in the same column family โ— keys & column names are often composite โ— supports secondary indexes with low cardinality
  • 8. cassandra data model keyspace keyspace standard_column_family super_column_family key1 key1 column1 : value1 supercol1 column2 : value2 column1 : value1 key2 column2 : value2 column1 : value1 supercol2 column3 : value3 column3 : value3 column4 : value4 key2 supercol3 column4 : value4 supercol4 column5 : value5
  • 9. let's have a look ...
  • 10. cassandra query model โ— get by key(s) โ— range slice - from key A to key B in sort order (partitioner choice matters here) โ— column slice - from column A to column B ordered according to specified comparator type โ— indexed slice - all rows that match a predicate (requires a secondary index to be in place for the column in the predicate) โ— "write it like you intend to read it" โ— Cassandra Query Language (CQL)
  • 11. cassandra topology Node 1 Start token 0 Node 2 Node 3 Start token Start token 2^127 / 3 2 * 2^127 / 3
  • 12. cassandra client access โ— uses Thrift as the underlying protocol with a number of language bindings โ— CLI is useful for simple CRUD and schema changes โ— high-level clients available for most common languages -- this is the recommended approach โ— CQL supported by some clients
  • 13. cassandra with hadoop โ— supports input via ColumnFamilyInputFormat โ— supports output via ColumnFamilyOutputFormat โ— HDFS still used by Hadoop unless you use DataStax Enterprise version โ— supports Pig input/output via CassandraStorage โ— preserves data locality โ— use 2 "data centers" -- one for transactional and the other for analytics โ— replication between data centers is automatic and immediate
  • 14. cassandra with hadoop - topology Node 4 Node 1 Start token Hadoop N1 + 1 NameNode Start token JobTracker 0 DataNode TaskTracker Replica Replica Replication Node 5 Node 6 Node 2 Node 3 Start token Start token Start token Start token N2 + 1 N3 + 1 2^127 / 3 2 * 2^127 / 3 DataNode DataNode Replica TaskTracker TaskTracker Data Center 1 - Transactional Data Center 2 - Analytics
  • 15. configuring transactional nodes โ— install Cassandra 1.1 โ— config properties are in cassandra.yaml (in /etc/cassandra or $CASSANDRA_HOME/conf) and in create keyspace options โ— placement strategy = NetworkTopologyStrategy โ— strategy options = DC1:2, DC2:1 โ— endpoint snitch = PropertyFileSnitch -- put DC/rack specs in cassandra-topology.properties โ— designate at least one node as seed(s) โ— make sure listen & rpc addresses are correct โ— use nodetool -h <host> ring to verify everyone is talking
  • 16. configuring analytics nodes โ— configure Cassandra using transactional instructions โ— insure there are no token collisions โ— install Hadoop 1.0.1+ TaskTracker & DataNode โ— add Cassandra jars & dependencies to Hadoop classpath -- either copy them to $HADOOP_HOME/lib or add the path in hadoop-env.sh โ— Cassandra jars are typically in /usr/share/cassandra, and dependencies in /usr/share/cassandra/lib if installed using a package manager โ— you may need to increase Cassandra RPC timeout
  • 17. let's have a look ...
  • 18. writing map/reduce Assumptions: โ— Hadoop 1.0.1+ โ— new Hadoop API (i.e. the one in org.apache.hadoop. mapreduce, not mapred) โ— Cassandra 1.1
  • 19. job configuration โ— use Hadoop job configuration API to set general Hadoop configuration parameters โ— use Cassandra ConfigHelper to set Cassandra-specific configuration parameters
  • 20. job configuration - example Hadoop config related to Cassandra: job.setInputFormatClass(ColumnFamilyInputFormat.class) job.setOutputFormatClass(ColumnFamilyOutputFormat.class) job.setOutputKeyClass(ByteBuffer.class) job.setOutputValueClass(List.class)
  • 21. job configuration - example Cassandra-specific config: ConfigHelper.setInputRpcPort(conf, "9160") ConfigHelper.setInputInitialAddress(conf, cassHost) ConfigHelper.setInputPartitioner(conf, partitionerClassName) ConfigHelper.setInputColumnFamily(conf, keyspace, columnFamily) ConfigHelper.setInputSlicePredicate(conf, someQueryPredicate) ConfigHelper.setOutputRpcPort(conf, "9160") ConfigHelper.setOutputInitialAddress(conf, cassHost) ConfigHelper.setOutputPartitioner(conf, partitionerClassName) ConfigHelper.setOutputColumnFamily(conf, keyspace, columnFamily)
  • 22. the mapper โ— input key is a ByteBuffer โ— input value is a SortedMap<ByteBuffer, IColumn>, where the ByteBuffer is the column name โ— use Cassandra ByteBufferUtil to read/write ByteBuffers โ— map output must be a Writable as in "standard" M/R
  • 23. example mapper Group integer values by column name: for (IColumn col : columns.values()) context.write(new Text(ByteBufferUtil.string(col.name())), new LongWritable(ByteBufferUtil.toLong(col.value())));
  • 24. the reducer โ— input is whatever you output from map โ— output key is ByteBuffer == row key in Cassandra โ— output value is List<Mutation> == the columns you want to change โ— uses Thrift API to define Mutations
  • 25. example reducer Compute average of each column's values: int sum = 0; int count = 0; for (LongWritable val : values) { sum += val.get(); count++; } Column c = new Column(); c.setName(ByteBufferUtil.bytes("Average")); c.setValue(ByteBufferUtil.bytes(sum/count)); c.setTimestamp(System.currentTimeMillis()); Mutation m = new Mutation(); m.setColumn_or_supercolumn(new ColumnOrSuperColumn()); m.column_or_supercolumn.setColumn(c); context.write(ByteBufferUtil.bytes(key.toString()), Collections.singletonList(m));
  • 26. let's have a look ...
  • 27. outputting to multiple CFs โ— currently not supported out of the box โ— patch available against 1.1 to allow this using Hadoop MultipleOutputs API โ— not necessary to patch production Cassandra, only the Cassandra jars on Hadoop's classpath โ— patch will not break existing reducers โ— to use: โ—‹ git clone Cassandra 1.1 source โ—‹ use git apply to apply patch โ—‹ build with ant โ—‹ put the new jars on Hadoop classpath
  • 28. multiple output example // in job configuration ConfigHelper.setOutputKeyspace(conf, keyspace); MultipleOutputs.addNamedOutput(job, columnFamily1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class); MultipleOutputs.addNamedOutput(job, columnFamily2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
  • 29. multiple output example // in reducer private MultipleOutputs _output; public void setup(Context context) { _output = new MultipleOutputs(context); } public void cleanup(Context context) { _output.close(); }
  • 30. multiple output example // in reducer - writing to an output Mutation m1 = ... Mutation m2 = ... _output.write(columnFamily1, ByteBufferUtil.bytes(key1), m1); _output.write(columnFamily2, ByteBufferUtil.bytes(key2), m2);
  • 31. let's have a look ...
  • 32. map/reduce in scala why? because this: values.groupBy(_.name) .map { case (name, vals) => name -> vals.map(ByteBufferUtil.toInt).sum } .filter { case (name, sum) => sum > 100 } .foreach { case (name, sum) => context.write(new Text(ByteBufferUtil.string(name)), new IntWritable(sum)) } ... would require many more lines of code in Java and be less readable!
  • 33. map/reduce in scala ... and because we can parallelize it by doing this: values.par // make this collection parallel .groupBy(_.name) .map { case (name, vals) => name -> vals.map(ByteBufferUtil.toInt).sum } .filter { case (name, sum) => sum > 100 } .foreach { case (name, sum) => context.write(new Text(ByteBufferUtil.string(name)), new IntWritable(sum)) } (!!!)
  • 34. map/reduce in scala โ— add Scala library to Hadoop classpath โ— import scala.collection.JavaConversions._ to get Scala coolness with Java collections โ— main executable is an empty class with the implementation in a companion object โ— context parm in map & reduce methods must be: โ—‹ context: Mapper[ByteBuffer, SortedMap[ByteBuffer, IColumn], <MapKeyClass>, <MapValueClass>]#Context โ—‹ context: Reducer[<MapKeyClass>, <MapValueClass>, ByteBuffer, java.util.List[Mutation]]#Context โ—‹ it will compile but not function correctly without the type information
  • 35. let's have a look ...
  • 36. pig โ— allows ad hoc queries against Cassandra โ— schema applied at query time โ— can run in local mode or using Hadoop cluster โ— useful for data exploration, mass updates, back- populating data, etc. โ— cassandra supported via CassandraStorage โ— both LoadFunc and StoreFunc โ— now comes pre-built in Cassandra distribution โ— pygmalion adds some useful Cassandra UDFs -- github.com/jeromatron/pygmalion
  • 37. pig - getting set up โ— download pig - better not to use package manager โ— get the pig_cassandra script from Cassandra source (it's in examples/pig) -- or I put it on github with all the samples โ— set environment variables as described in README โ—‹ JAVA_HOME = location of your Java install โ—‹ PIG_HOME = location of your Pig install โ—‹ PIG_CONF_DIR = location of your Hadoop config files โ—‹ PIG_INITIAL_ADDRESS = address of one Cassandra analytics node โ—‹ PIG_RPC_PORT = Thrift port -- default 9160 โ—‹ PIG_PARTITIONER = your Cassandra partitioner
  • 38. pig - loading data Standard CF: rows = LOAD 'cassandra://<keyspace>/<column_family>' USING CassandraStorage() AS (key: bytearray, cols: bag {col: tuple(name,value)}); Super CF: rows = LOAD 'cassandra://<keyspace>/<column_family>' USING CassandraStorage() AS (key: bytearray, cols: bag {(supercol, subcols: bag {(name, value)})});
  • 39. pig - writing data STORE command: STORE <relation_name> INTO 'cassandra://<keyspace>/<column_family>' USING CassandraStorage(); Standard CF schema: (key, {(col_name, value), (col_name, value)}); Super CF schema: (key, {supercol: {(col_name, value), (col_name, value)}, {supercol: {(col_name, value), (col_name, value)}});
  • 40. let's have a look ...
  • 41. questions? check out my github for this presentation and code samples! Robbie Strickland Managing Architect, E3 Greentech rostrickland@gmail.com linkedin.com/in/robbiestrickland github.com/rstrickland