Spark + Cassandra 
Carl Yeksigian 
DataStax
Spark 
-Fast large-scale data processing framework 
-Focused on in-memory workloads 
-Supports Java, Scala, and Python 
-Integrated machine learning support (MLlib) 
-Streaming support 
-Simple developer API
Resilient Distributed Dataset (RDD) 
-Presents a simple Collection API to the 
developer 
-Breaks full collection into partitions, which can 
be operated on independently 
-Knows how to recalculate itself if data is lost 
-Abstracts how to complete a job from the tasks
RDD
RDD API
Partitions 
-Partitions can be created so they are on the 
same machine as the data
Uses for Spark with Cassandra 
-Ad-hoc queries 
-Joins, Unions across tables 
-Rewriting tables 
-Machine Learning
spark-cassandra-connector 
DataStax OSS Project 
https://github.com/datastax/spark-cassandra-connector
Spark Cassandra Connector 
-Exposes Cassandra tables as RDDs 
-Read from and write to Cassandra 
-Data type mapping 
-Scala and Java support
Spark + Bioinformatics 
-ADAM is a bioinformatics project out of UC 
Berkeley AMPLab 
-Combines Spark + Parquet + Avro 
https://github.com/bigdatagenomics/adam 
http://bdgenomics.org/
Simple Variant 
case class Variant ( 
sampleid: String, 
referencename: String, 
location: Long, 
allele: String) 
create table adam.variants ( 
sampleid ascii, 
referencename ascii, 
location bigint, 
allele ascii)
Connecting to Cassandra 
import com.datastax.spark.connector._ 
// Spark connection options 
val conf = new SparkConf(true) 
.setMaster("spark://192.168.345.10:7077") 
.setAppName("cassandra-demo") 
.set("cassandra.connection.host", "192.168.345.10") 
val sc = new SparkContext(conf)
Saving To Cassandra 
val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0)) 
variants.flatMap(getVariant) 
.saveToCassandra("adam", "variants", AllColumns)
Querying Cassandra 
val rdd = sc.cassandraTable("adam", "variants") 
.map(r => (r.get[String]("allele"), 1L)) 
.reduceByKey(_ + _) 
.map(r => (r._2, r._1)) 
.sortByKey(ascending = false) 
rdd.collect() 
.foreach(bc => println("%40st%d".format(bc._2, bc._1)))
Thanks 
Acknowledgements: 
Timothy Danford (AMPLab) 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Jeff Hammerbacher (Cloudera/Mt Sinai)

Spark + Cassandra

  • 1.
    Spark + Cassandra Carl Yeksigian DataStax
  • 2.
    Spark -Fast large-scaledata processing framework -Focused on in-memory workloads -Supports Java, Scala, and Python -Integrated machine learning support (MLlib) -Streaming support -Simple developer API
  • 3.
    Resilient Distributed Dataset(RDD) -Presents a simple Collection API to the developer -Breaks full collection into partitions, which can be operated on independently -Knows how to recalculate itself if data is lost -Abstracts how to complete a job from the tasks
  • 4.
  • 5.
  • 6.
    Partitions -Partitions canbe created so they are on the same machine as the data
  • 7.
    Uses for Sparkwith Cassandra -Ad-hoc queries -Joins, Unions across tables -Rewriting tables -Machine Learning
  • 8.
    spark-cassandra-connector DataStax OSSProject https://github.com/datastax/spark-cassandra-connector
  • 9.
    Spark Cassandra Connector -Exposes Cassandra tables as RDDs -Read from and write to Cassandra -Data type mapping -Scala and Java support
  • 10.
    Spark + Bioinformatics -ADAM is a bioinformatics project out of UC Berkeley AMPLab -Combines Spark + Parquet + Avro https://github.com/bigdatagenomics/adam http://bdgenomics.org/
  • 11.
    Simple Variant caseclass Variant ( sampleid: String, referencename: String, location: Long, allele: String) create table adam.variants ( sampleid ascii, referencename ascii, location bigint, allele ascii)
  • 12.
    Connecting to Cassandra import com.datastax.spark.connector._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.345.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.345.10") val sc = new SparkContext(conf)
  • 13.
    Saving To Cassandra val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0)) variants.flatMap(getVariant) .saveToCassandra("adam", "variants", AllColumns)
  • 14.
    Querying Cassandra valrdd = sc.cassandraTable("adam", "variants") .map(r => (r.get[String]("allele"), 1L)) .reduceByKey(_ + _) .map(r => (r._2, r._1)) .sortByKey(ascending = false) rdd.collect() .foreach(bc => println("%40st%d".format(bc._2, bc._1)))
  • 15.
    Thanks Acknowledgements: TimothyDanford (AMPLab) Matt Massie (AMPLab) Frank Nothaft (AMPLab) Jeff Hammerbacher (Cloudera/Mt Sinai)