Uploaded on

Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining …

Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,548
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
40
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Escape From Hadoop: Spark One Liners for C* Ops Kurt Russell Spitzer DataStax
  • 2. Who am I? • Bioinformatics Ph.D from UCSF • Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK!! • Spends a lot of time spinning up clusters on EC2, GCE, Azure, … http://www.datastax.com/dev/ blog/testing-cassandra-1000- nodes-at-a-time • Developing new ways to make sure that C* Scales
  • 3. Why escape from Hadoop? HADOOP Many Moving Pieces Map Reduce Single Points of Failure Lots of Overhead And there is a way out!
  • 4. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? RDD! Spark Worker Spark Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor
  • 5. Spark is Compatible with HDFS, Parquet, CSVs, ….
  • 6. Spark is Compatible with HDFS, Parquet, CSVs, …. AND APACHE CASSANDRA Apache Cassandra
  • 7. Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
  • 8. Apache Cassandra Architecture is Very Simple Node Roles 1 Replication Tunable Replication Consistency Tunable C* C* C* C* Client
  • 9. DataStax OSS Connector Spark to Cassandra https://github.com/datastax/spark4cassandra4connector Cassandra Spark Keyspace Table RDD[CassandraRow] RDD[Tuples] Bundled9and9Supported9with9DSE94.5!
  • 10. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1001 -2000 Tokens 1-1000 Tokens … RDD’s read into different splits based on sets of tokens
  • 11. Co-locate Spark and C* for Best Performance C* C* C* Spark Worker C* Spark Worker Spark Master Spark Running Spark Workers Worker on the same nodes as your C* Cluster will save network hops when reading and writing
  • 12. Setting up C* and Spark DSE > 4.5.0 Just start your nodes with dse cassandra -k Apache Cassandra Follow the excellent guide by Al Tobey http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
  • 13. We need a Distributed System For Analytics and Batch Jobs But it doesn’t have to be complicated!
  • 14. Even count needs to be distributed Ask me to write a Map Reduce for word count, I dare you. You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or we could just do one liners on the spark shell.
  • 15. Basics: Getting a Table and Counting CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};& use&newyork;& CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&);
  • 16. Basics: Getting a Table and Counting CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};& use&newyork;& CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&); scala>&sc.cassandraTable(“newyork","presidentlocations")& & & cassandraTable
  • 17. Basics: Getting a Table and Counting CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};& use&newyork;& CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);& INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&); scala>&sc.cassandraTable(“newyork","presidentlocations")& & .count& res3:&Long&=&10 cassandraTable count 10
  • 18. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations")& cassandraTable
  • 19. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& ! res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC
  • 20. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& ! res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC scala>&sc.cassandraTable(“newyork","presidentlocations") cassandraTable
  • 21. Basics: take() and toArray scala>&sc.cassandraTable("newyork","presidentlocations").take(1)& ! res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC scala>&sc.cassandraTable(“newyork","presidentlocations").toArray& ! res3:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&3,&location:&White&House},&& & …, & CassandraRow{time:&6,&location:&Air&Force&1}) cassandraTable toArray Array of CassandraRows 9 NYC 99 NNYYCC 99 NNYYCC
  • 22. Basics: Getting Row Values out of a CassandraRow scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")& ! res5:&Int&=&9 cassandraTable http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 23. Basics: Getting Row Values out of a CassandraRow scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")& ! res5:&Int&=&9 cassandraTable take(1) Array of CassandraRows 9 NYC http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 24. Basics: Getting Row Values out of a CassandraRow scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")& ! res5:&Int&=&9 cassandraTable take(1) Array of CassandraRows 9 NYC 9 get[Int] get[Int] get[String] … get[Any] Got Null ? get[Option[Int]] http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 25. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& );
  • 26. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house
  • 27. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house
  • 28. Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable get[Int] get[String] 1 white house 1,president,white house
  • 29. get[Int] get[String] C* Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house 1,president,white house saveToCassandra
  • 30. get[Int] get[String] C* Copy A Table Say we want to restructure our table or add a new column? CREATE&TABLE&characterlocations&(& & time&int,&& & character&text,&& & location&text,&& & PRIMARY&KEY&(time,character)& ); sc.cassandraTable(“newyork","presidentlocations")& & .map(&row&=>&(& & & & row.get[Int](“time"),& & & & "president",&& & & & row.get[String](“location")& & )).saveToCassandra("newyork","characterlocations") cqlsh:newyork>&SELECT&*&FROM&characterlocations&;& ! &time&|&character&|&location& kkkkkk+kkkkkkkkkkk+kkkkkkkkkkkkk& &&&&5&|&president&|&Air&Force&1& &&&10&|&president&|&&&&&&&&&NYC& …& …& cassandraTable 1 white house 1,president,white house saveToCassandra
  • 31. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable
  • 32. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable Filter
  • 33. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable Filter _ (Anonymous Param) 1 white house
  • 34. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable Filter 1 white house get[Int] 1 _ (Anonymous Param)
  • 35. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable _ (Anonymous Param) >7 1 white house get[Int] 1 Filter
  • 36. Filter a Table What if we want to filter based on a non-clustering key column? scala>&sc.cassandraTable(“newyork","presidentlocations")& & .filter(&_.get[Int]("time")&>&7&)& & .toArray& ! res9:&Array[com.datastax.spark.connector.CassandraRow]&=&& Array(& & CassandraRow{time:&9,&location:&NYC},&& & CassandraRow{time:&10,&location:&NYC},&& & CassandraRow{time:&8,&location:&NYC}& ) cassandraTable _ (Anonymous Param) >7 1 white house get[Int] 1 Filter
  • 37. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure.
  • 38. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations")& & .saveToCassandra("newyork","timelines") 1 white house cassandraTable president
  • 39. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations")& & .saveToCassandra("newyork","timelines") 1 white house cassandraTable saveToCassandra president C*
  • 40. Backfill a Table with a Different Key! CREATE&TABLE&timelines&(& &&time&int,& &&character&text,& &&location&text,& &&PRIMARY&KEY&((character),&time)& ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations")& & .saveToCassandra("newyork","timelines") cqlsh:newyork>&select&*&from&timelines;& ! &character&|&time&|&location& kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkk& &president&|&&&&1&|&White&House& &president&|&&&&2&|&White&House& &president&|&&&&3&|&White&House& &president&|&&&&4&|&White&House& &president&|&&&&5&|&Air&Force&1& &president&|&&&&6&|&Air&Force&1& &president&|&&&&7&|&Air&Force&1& &president&|&&&&8&|&&&&&&&&&NYC& &president&|&&&&9&|&&&&&&&&&NYC& &president&|&&&10&|&&&&&&&&&NYC 1 white house cassandraTable saveToCassandra president C*
  • 41. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile
  • 42. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve
  • 43. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve plissken,1,Federal Reserve
  • 44. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve plissken,1,Federal Reserve saveToCassandra C*
  • 45. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,white house split plissken 1 white house plissken,1,white house saveToCassandra C* cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';& ! &character&|&time&|&location& kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk& &&plissken&|&&&&1&|&Federal&Reserve& &&plissken&|&&&&2&|&Federal&Reserve& &&plissken&|&&&&3&|&Federal&Reserve& &&plissken&|&&&&4&|&&&&&&&&&&&Court& &&plissken&|&&&&5&|&&&&&&&&&&&Court& &&plissken&|&&&&6&|&&&&&&&&&&&Court& &&plissken&|&&&&7&|&&&&&&&&&&&Court& &&plissken&|&&&&8&|&&Stealth&Glider& &&plissken&|&&&&9&|&&&&&&&&&&&&&NYC& &&plissken&|&&&10&|&&&&&&&&&&&&&NYC
  • 46. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)& & .map(_.split(“,"))& & .map(&line&=>&& & & (line(0),line(1),line(2)))& & .saveToCassandra("newyork","timelines") textFile Map plissken,1,white house split plissken 1 white house plissken,1,white house saveToCassandra C* cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';& ! &character&|&time&|&location& kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk& &&plissken&|&&&&1&|&Federal&Reserve& &&plissken&|&&&&2&|&Federal&Reserve& &&plissken&|&&&&3&|&Federal&Reserve& &&plissken&|&&&&4&|&&&&&&&&&&&Court& &&plissken&|&&&&5&|&&&&&&&&&&&Court& &&plissken&|&&&&6&|&&&&&&&&&&&Court& &&plissken&|&&&&7&|&&&&&&&&&&&Court& &&plissken&|&&&&8&|&&Stealth&Glider& &&plissken&|&&&&9&|&&&&&&&&&&&&&NYC& &&plissken&|&&&10&|&&&&&&&&&&&&&NYC
  • 47. Perform a Join with MySQL Maybe a little more than one line … MySQL Table “quotes” in “escape_from_ny” import&java.sql._& import&org.apache.spark.rdd.JdbcRDD& Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J&added&toSpark&Shell&Classpath& val&quotes&=&new&JdbcRDD(& & sc,&& & ()&=>&{& & & DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},&& & "SELECT&*&FROM&quotes&WHERE&?&<=&ID&and&ID&<=&?”,& & 0,& & 100,& & 5,&& & (r:&ResultSet)&=>&{& & & (r.getInt(2),r.getString(3))& & }& )& ! quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
  • 48. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD Needs to be in the form of RDD[K,V] 5, ‘Bob Hauk: …'
  • 49. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD plissken,5,court 5,court 5, ‘Bob Hauk: …'
  • 50. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD plissken,5,court 5,court 5,(‘Bob Hauk: …’,court) 5, ‘Bob Hauk: …'
  • 51. Perform a Join with MySQL Maybe a little more than one line … quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23& ! quotes.join(& & sc.cassandraTable(“newyork","timelines")& & .filter(&_.get[String]("character")&==&“plissken")& & .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))& & .take(1)& & .foreach(println)& ! (5,& & (Bob&Hauk:&& There&was&an&accident.&& & & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&& & & & The&President&was&on&board.& & &Snake&Plissken:&The&president&of&what?,& & Court)& ) cassandraTable JdbcRDD plissken,5,court 5,court 5,(‘Bob Hauk: …’,court) 5, ‘Bob Hauk: …'
  • 52. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) timelineRow character,time,location
  • 53. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location
  • 54. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken
  • 55. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8
  • 56. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8 character:plissken,time:8,location: Stealth Glider
  • 57. Easy Objects with Case Classes We have the technology to make this even easier! case&class&timelineRow&&(character:String,&time:Int,&location:String)& sc.cassandraTable[timelineRow](“newyork","timelines")& & .filter(&_.character&==&“plissken")& & .filter(&_.time&==&8)& & .toArray& res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider)) The Future cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8 character:plissken,time:8,location: Stealth Glider
  • 58. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) cassandraTable
  • 59. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house cassandraTable get[String]
  • 60. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house cassandraTable get[String] _.split()
  • 61. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house white, 1 house, 1 cassandraTable get[String] _.split() (_,1)
  • 62. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house white, 1 house, 1 house, 1 house, 1 house, 2 cassandraTable get[String] _.split() (_,1) _ + _
  • 63. A Map Reduce for Word Count … scala>&sc.cassandraTable(“newyork”,"presidentlocations")& & .map(&_.get[String](“location”)&)& & .flatMap(&_.split(“&“))& & .map(&(_,1))& & .reduceByKey(&_&+&_&)& & .toArray& res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3)) 1 white house white house white, 1 house, 1 house, 1 house, 1 house, 2 cassandraTable get[String] _.split() (_,1) _ + _
  • 64. Stand Alone App Example https://github.com/RussellSpitzer/spark4cassandra4csv Car,:Model,:Color Dodge,:Caravan,:Red: Ford,:F150,:Black: Toyota,:Prius,:Green Spark SCC RDD: [CassandraRow] !!! FavoriteCars Table Cassandra Column:Mapping CSV
  • 65. Thanks for listening! There is plenty more we can do with Spark but … Questions?
  • 66. Getting started with Cassandra?! DataStax Academy offers free online Cassandra training! Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages! Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org! Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly! Email us: Community@DataStax.com! Thanks for coming to the meetup!! In production?! Tweet us: @PlanetCassandra!
  • 67. Thanks:for:your:Time:and:Come:to:C*:Summit!: SEPTEMBER91094911,9201499|99SAN9FRANCISCO,9CALIF.99|99THE9WESTIN9ST.9FRANCIS9HOTEL Cassandra:Summit:Link