Escape From Hadoop: 
Spark One Liners for C* Ops 
Kurt Russell Spitzer 
DataStax
Who am I? 
• Bioinformatics Ph.D from UCSF 
• Works on the integration of 
Cassandra (C*) with Hadoop, 
Solr, and SPARK!! ...
Why escape from Hadoop? 
HADOOP 
Many Moving Pieces 
Map Reduce 
Single Points of Failure 
Lots of Overhead 
And there is ...
Spark Provides a Simple and Efficient 
framework for Distributed Computations 
Node Roles 2 
In Memory Caching Yes! 
Gener...
Spark is Compatible with HDFS, 
Parquet, CSVs, ….
Spark is Compatible with HDFS, 
Parquet, CSVs, …. 
AND 
APACHE CASSANDRA 
Apache 
Cassandra
Apache Cassandra is a Linearly Scaling 
and Fault Tolerant noSQL Database 
Linearly Scaling: 
The power of the database 
i...
Apache Cassandra 
Architecture is Very Simple 
Node Roles 1 
Replication Tunable 
Replication 
Consistency Tunable 
C* 
C*...
DataStax OSS Connector 
Spark to Cassandra 
https://github.com/datastax/spark4cassandra4connector 
Cassandra Spark 
Keyspa...
Spark Cassandra Connector uses the 
DataStax Java Driver to Read from and 
Write to C* 
Spark C* 
Full Token 
Range 
Each ...
Co-locate Spark and C* for 
Best Performance 
C* 
C* C* 
Spark 
Worker 
C* 
Spark 
Worker 
Spark 
Master 
Spark 
Running S...
Setting up C* and Spark 
DSE > 4.5.0 
Just start your nodes with 
dse cassandra -k 
Apache Cassandra 
Follow the excellent...
We need a Distributed System 
For Analytics and Batch Jobs 
But it doesn’t have to be complicated!
Even count needs to be 
distributed 
Ask me to write a Map Reduce 
for word count, I dare you. 
You could make this easier...
Basics: Getting a Table and 
Counting 
CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication...
Basics: Getting a Table and 
Counting 
CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication...
Basics: Getting a Table and 
Counting 
CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication...
Basics: take() and toArray 
scala>&sc.cassandraTable("newyork","presidentlocations")& 
cassandraTable
Basics: take() and toArray 
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& 
! 
res2:&Array[com.datastax...
Basics: take() and toArray 
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& 
! 
res2:&Array[com.datastax...
Basics: take() and toArray 
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& 
! 
res2:&Array[com.datastax...
Basics: Getting Row Values 
out of a CassandraRow 
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get...
Basics: Getting Row Values 
out of a CassandraRow 
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get...
Basics: Getting Row Values 
out of a CassandraRow 
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get...
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE&TABLE&characterlocations&(& 
& time&int,&&...
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE&TABLE&characterlocations&(& 
& time&int,&&...
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE&TABLE&characterlocations&(& 
& time&int,&&...
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE&TABLE&characterlocations&(& 
& time&int,&&...
get[Int] get[String] 
C* 
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE&TABLE&characterl...
get[Int] get[String] 
C* 
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE&TABLE&characterl...
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala>&sc.cassandraTable(“newyork","pres...
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala>&sc.cassandraTable(“newyork","pres...
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala>&sc.cassandraTable(“newyork","pres...
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala>&sc.cassandraTable(“newyork","pres...
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala>&sc.cassandraTable(“newyork","pres...
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala>&sc.cassandraTable(“newyork","pres...
Backfill a Table with a 
Different Key! 
CREATE&TABLE&timelines&(& 
&&time&int,& 
&&character&text,& 
&&location&text,& 
&...
Backfill a Table with a 
Different Key! 
CREATE&TABLE&timelines&(& 
&&time&int,& 
&&character&text,& 
&&location&text,& 
&...
Backfill a Table with a 
Different Key! 
CREATE&TABLE&timelines&(& 
&&time&int,& 
&&character&text,& 
&&location&text,& 
&...
Backfill a Table with a 
Different Key! 
CREATE&TABLE&timelines&(& 
&&time&int,& 
&&character&text,& 
&&location&text,& 
&...
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Use...
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Use...
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Use...
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Use...
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Use...
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Use...
Perform a Join with MySQL 
Maybe a little more than one line … 
MySQL Table “quotes” in “escape_from_ny” 
import&java.sql....
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&Jdbc...
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&Jdbc...
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&Jdbc...
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&Jdbc...
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case&class&timelineRow&&(character:Stri...
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case&class&timelineRow&&(character:Stri...
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case&class&timelineRow&&(character:Stri...
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case&class&timelineRow&&(character:Stri...
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case&class&timelineRow&&(character:Stri...
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case&class&timelineRow&&(character:Stri...
A Map Reduce for Word 
Count … 
scala>&sc.cassandraTable(“newyork”,"presidentlocations")& 
& .map(&_.get[String](“location...
A Map Reduce for Word 
Count … 
scala>&sc.cassandraTable(“newyork”,"presidentlocations")& 
& .map(&_.get[String](“location...
A Map Reduce for Word 
Count … 
scala>&sc.cassandraTable(“newyork”,"presidentlocations")& 
& .map(&_.get[String](“location...
A Map Reduce for Word 
Count … 
scala>&sc.cassandraTable(“newyork”,"presidentlocations")& 
& .map(&_.get[String](“location...
A Map Reduce for Word 
Count … 
scala>&sc.cassandraTable(“newyork”,"presidentlocations")& 
& .map(&_.get[String](“location...
A Map Reduce for Word 
Count … 
scala>&sc.cassandraTable(“newyork”,"presidentlocations")& 
& .map(&_.get[String](“location...
Stand Alone App Example 
https://github.com/RussellSpitzer/spark4cassandra4csv 
Car,:Model,:Color 
Dodge,:Caravan,:Red: 
F...
Thanks for listening! 
There is plenty more we can do with Spark but … 
Questions?
Getting started with Cassandra?! 
DataStax Academy offers free online Cassandra training! 
Planet Cassandra has resources ...
Thanks:for:your:Time:and:Come:to:C*:Summit!: 
SEPTEMBER91094911,9201499|99SAN9FRANCISCO,9CALIF.99|99THE9WESTIN9ST.9FRANCIS...
Upcoming SlideShare
Loading in...5
×

Escape from Hadoop

2,194

Published on

Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!

Published in: Technology

Escape from Hadoop

  1. 1. Escape From Hadoop: Spark One Liners for C* Ops Kurt Russell Spitzer DataStax
  2. 2. Who am I? • Bioinformatics Ph.D from UCSF • Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK!! • Spends a lot of time spinning up clusters on EC2, GCE, Azure, … http://www.datastax.com/dev/ blog/testing-cassandra-1000- nodes-at-a-time • Developing new ways to make sure that C* Scales
  3. 3. Why escape from Hadoop? HADOOP Many Moving Pieces Map Reduce Single Points of Failure Lots of Overhead And there is a way out!
  4. 4. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? RDD! Spark Worker Spark Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor
  5. 5. Spark is Compatible with HDFS, Parquet, CSVs, ….
  6. 6. Spark is Compatible with HDFS, Parquet, CSVs, …. AND APACHE CASSANDRA Apache Cassandra
  7. 7. Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
  8. 8. Apache Cassandra Architecture is Very Simple Node Roles 1 Replication Tunable Replication Consistency Tunable C* C* C* C* Client
  9. 9. DataStax OSS Connector Spark to Cassandra https://github.com/datastax/spark4cassandra4connector Cassandra Spark Keyspace Table RDD[CassandraRow] RDD[Tuples] Bundled9and9Supported9with9DSE94.5!
  10. 10. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1001 -2000 Tokens 1-1000 Tokens … RDD’s read into different splits based on sets of tokens
  11. 11. Co-locate Spark and C* for Best Performance C* C* C* Spark Worker C* Spark Worker Spark Master Spark Running Spark Workers Worker on the same nodes as your C* Cluster will save network hops when reading and writing
  12. 12. Setting up C* and Spark DSE > 4.5.0 Just start your nodes with dse cassandra -k Apache Cassandra Follow the excellent guide by Al Tobey http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
  13. 13. We need a Distributed System For Analytics and Batch Jobs But it doesn’t have to be complicated!
  14. 14. Even count needs to be distributed Ask me to write a Map Reduce for word count, I dare you. You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or we could just do one liners on the spark shell.
  15. 15. Basics: Getting a Table and Counting CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};& use&newyork;& CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&);
  16. 16. Basics: Getting a Table and Counting CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};& use&newyork;& CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&); scala>&sc.cassandraTable(“newyork","presidentlocations")& & & cassandraTable
  17. 17. Basics: Getting a Table and Counting CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};& use&newyork;& CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&); scala>&sc.cassandraTable(“newyork","presidentlocations")& & .count& res3:&Long&=&10 cassandraTable count 10
  18. 18. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations")& cassandraTable
  19. 19. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& ! res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC
  20. 20. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& ! res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC scala>&sc.cassandraTable(“newyork","presidentlocations") cassandraTable
  21. 21. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& ! res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC scala>&sc.cassandraTable(“newyork","presidentlocations").toArray& ! res3:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&3,&location:&White&House},&& & …, & CassandraRow{time:&6,&location:&Air&Force&1}) cassandraTable toArray Array of CassandraRows 9 NYC 99 NNYYCC 99 NNYYCC
  22. 22. Basics: Getting Row Values out of a CassandraRow scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")& ! res5:&Int&=&9 cassandraTable http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  23. 23. Basics: Getting Row Values out of a CassandraRow scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")& ! res5:&Int&=&9 cassandraTable take(1) Array of CassandraRows 9 NYC http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  24. 24. Basics: Getting Row Values out of a CassandraRow scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")& ! res5:&Int&=&9 cassandraTable take(1) Array of CassandraRows 9 NYC 9 get[Int] get[Int] get[String] … get[Any] Got Null ? get[Option[Int]] http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  25. 25. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& );
  26. 26. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house
  27. 27. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house
  28. 28. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable get[Int] get[String] 1 white house 1,president,white house
  29. 29. get[Int] get[String] C* Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house 1,president,white house saveToCassandra
  30. 30. get[Int] get[String] C* Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cqlsh:newyork>&SELECT&*&FROM&characterlocations&;& ! &time&|&character&|&location& kkkkkk+kkkkkkkkkkk+kkkkkkkkkkkkk& &&&&5&|&president&|&Air&Force&1& &&&10&|&president&|&&&&&&&&&NYC& …& …& cassandraTable 1 white house 1,president,white house saveToCassandra
  31. 31. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable
  32. 32. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable Filter
  33. 33. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable Filter _ (Anonymous Param) 1 white house
  34. 34. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable Filter 1 white house get[Int] 1 _ (Anonymous Param)
  35. 35. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable _ (Anonymous Param) >7 1 white house get[Int] 1 Filter
  36. 36. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable _ (Anonymous Param) >7 1 white house get[Int] 1 Filter
  37. 37. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure.
  38. 38. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations")& & .saveToCassandra("newyork","timelines") 1 white house cassandraTable president
  39. 39. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations")& & .saveToCassandra("newyork","timelines") 1 white house cassandraTable saveToCassandra president C*
  40. 40. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations")& & .saveToCassandra("newyork","timelines") cqlsh:newyork>&select&*&from&timelines;& ! &character&|&time&|&location& kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkk& &president&|&&&&1&|&White&House& &president&|&&&&2&|&White&House& &president&|&&&&3&|&White&House& &president&|&&&&4&|&White&House& &president&|&&&&5&|&Air&Force&1& &president&|&&&&6&|&Air&Force&1& &president&|&&&&7&|&Air&Force&1& &president&|&&&&8&|&&&&&&&&&NYC& &president&|&&&&9&|&&&&&&&&&NYC& &president&|&&&10&|&&&&&&&&&NYC 1 white house cassandraTable saveToCassandra president C*
  41. 41. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile
  42. 42. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve
  43. 43. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve plissken,1,Federal Reserve
  44. 44. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve plissken,1,Federal Reserve saveToCassandra C*
  45. 45. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,white house split plissken 1 white house plissken,1,white house saveToCassandra C* cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';& ! &character&|&time&|&location& kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk& &&plissken&|&&&&1&|&Federal&Reserve& &&plissken&|&&&&2&|&Federal&Reserve& &&plissken&|&&&&3&|&Federal&Reserve& &&plissken&|&&&&4&|&&&&&&&&&&&Court& &&plissken&|&&&&5&|&&&&&&&&&&&Court& &&plissken&|&&&&6&|&&&&&&&&&&&Court& &&plissken&|&&&&7&|&&&&&&&&&&&Court& &&plissken&|&&&&8&|&&Stealth&Glider& &&plissken&|&&&&9&|&&&&&&&&&&&&&NYC& &&plissken&|&&&10&|&&&&&&&&&&&&&NYC
  46. 46. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,white house split plissken 1 white house plissken,1,white house saveToCassandra C* cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';& ! &character&|&time&|&location& kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk& &&plissken&|&&&&1&|&Federal&Reserve& &&plissken&|&&&&2&|&Federal&Reserve& &&plissken&|&&&&3&|&Federal&Reserve& &&plissken&|&&&&4&|&&&&&&&&&&&Court& &&plissken&|&&&&5&|&&&&&&&&&&&Court& &&plissken&|&&&&6&|&&&&&&&&&&&Court& &&plissken&|&&&&7&|&&&&&&&&&&&Court& &&plissken&|&&&&8&|&&Stealth&Glider& &&plissken&|&&&&9&|&&&&&&&&&&&&&NYC& &&plissken&|&&&10&|&&&&&&&&&&&&&NYC
  47. 47. Perform a Join with MySQL Maybe a little more than one line … MySQL Table “quotes” in “escape_from_ny” import&java.sql._& import&org.apache.spark.rdd.JdbcRDD& Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J&added&toSpark&Shell&Classpath& val&quotes&=&new&JdbcRDD(& & sc,&& & ()&=>&{& & & DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},&& & "SELECT&*&FROM&quotes&WHERE&?&<=&ID&and&ID&<=&?”,& & 0,& & 100,& & 5,&& & (r:&ResultSet)&=>&{& & & (r.getInt(2),r.getString(3))& & }& )& ! quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
  48. 48. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD Needs to be in the form of RDD[K,V] 5, ‘Bob Hauk: …'
  49. 49. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD plissken,5,court 5,court 5, ‘Bob Hauk: …'
  50. 50. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD plissken,5,court 5,court 5,(‘Bob Hauk: …’,court) 5, ‘Bob Hauk: …'
  51. 51. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD plissken,5,court 5,court 5,(‘Bob Hauk: …’,court) 5, ‘Bob Hauk: …'
  52. 52. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) timelineRow character,time,location
  53. 53. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location
  54. 54. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken
  55. 55. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8
  56. 56. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8 character:plissken,time:8,location: Stealth Glider
  57. 57. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) The Future cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8 character:plissken,time:8,location: Stealth Glider
  58. 58. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) cassandraTable
  59. 59. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house cassandraTable get[String]
  60. 60. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house cassandraTable get[String] _.split()
  61. 61. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house white, 1 house, 1 cassandraTable get[String] _.split() (_,1)
  62. 62. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house white, 1 house, 1 house, 1 house, 1 house, 2 cassandraTable get[String] _.split() (_,1) _ + _
  63. 63. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house white, 1 house, 1 house, 1 house, 1 house, 2 cassandraTable get[String] _.split() (_,1) _ + _
  64. 64. Stand Alone App Example https://github.com/RussellSpitzer/spark4cassandra4csv Car,:Model,:Color Dodge,:Caravan,:Red: Ford,:F150,:Black: Toyota,:Prius,:Green Spark SCC RDD: [CassandraRow] !!! FavoriteCars Table Cassandra Column:Mapping CSV
  65. 65. Thanks for listening! There is plenty more we can do with Spark but … Questions?
  66. 66. Getting started with Cassandra?! DataStax Academy offers free online Cassandra training! Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages! Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org! Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly! Email us: Community@DataStax.com! Thanks for coming to the meetup!! In production?! Tweet us: @PlanetCassandra!
  67. 67. Thanks:for:your:Time:and:Come:to:C*:Summit!: SEPTEMBER91094911,9201499|99SAN9FRANCISCO,9CALIF.99|99THE9WESTIN9ST.9FRANCIS9HOTEL Cassandra:Summit:Link
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×