2. Spark
-Fast large-scale data processing framework
-Focused on in-memory workloads
-Supports Java, Scala, and Python
-Integrated machine learning support (MLlib)
-Streaming support
-Simple developer API
3. Resilient Distributed Dataset (RDD)
-Presents a simple Collection API to the
developer
-Breaks full collection into partitions, which can
be operated on independently
-Knows how to recalculate itself if data is lost
-Abstracts how to complete a job from the tasks
12. Connecting to Cassandra
import com.datastax.spark.connector._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.345.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.345.10")
val sc = new SparkContext(conf)
13. Saving To Cassandra
val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0))
variants.flatMap(getVariant)
.saveToCassandra("adam", "variants", AllColumns)