Spark + Cassandra

•Download as PPTX, PDF•

0 likes•758 views

Carl Yeksigian

An introduction to Spark and Cassandra.

Software

Spark + Cassandra
Carl Yeksigian
DataStax

Spark
-Fast large-scale data processing framework
-Focused on in-memory workloads
-Supports Java, Scala, and Python
-Integrated machine learning support (MLlib)
-Streaming support
-Simple developer API

Resilient Distributed Dataset (RDD)
-Presents a simple Collection API to the
developer
-Breaks full collection into partitions, which can
be operated on independently
-Knows how to recalculate itself if data is lost
-Abstracts how to complete a job from the tasks

Partitions
-Partitions can be created so they are on the
same machine as the data

Uses for Spark with Cassandra
-Ad-hoc queries
-Joins, Unions across tables
-Rewriting tables
-Machine Learning

spark-cassandra-connector
DataStax OSS Project
https://github.com/datastax/spark-cassandra-connector

Spark Cassandra Connector
-Exposes Cassandra tables as RDDs
-Read from and write to Cassandra
-Data type mapping
-Scala and Java support

Spark + Bioinformatics
-ADAM is a bioinformatics project out of UC
Berkeley AMPLab
-Combines Spark + Parquet + Avro
https://github.com/bigdatagenomics/adam
http://bdgenomics.org/

Simple Variant
case class Variant (
sampleid: String,
referencename: String,
location: Long,
allele: String)
create table adam.variants (
sampleid ascii,
referencename ascii,
location bigint,
allele ascii)

Connecting to Cassandra
import com.datastax.spark.connector._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.345.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.345.10")
val sc = new SparkContext(conf)

Saving To Cassandra
val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0))
variants.flatMap(getVariant)
.saveToCassandra("adam", "variants", AllColumns)

Querying Cassandra
val rdd = sc.cassandraTable("adam", "variants")
.map(r => (r.get[String]("allele"), 1L))
.reduceByKey(_ + _)
.map(r => (r._2, r._1))
.sortByKey(ascending = false)
rdd.collect()
.foreach(bc => println("%40st%d".format(bc._2, bc._1)))

Thanks
Acknowledgements:
Timothy Danford (AMPLab)
Matt Massie (AMPLab)
Frank Nothaft (AMPLab)
Jeff Hammerbacher (Cloudera/Mt Sinai)

What's hot

Performance of Spark vs MapReduceEdureka!

The SparkSQL things you maybe confusevito jeng

Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on DockerAnant Corporation

Cassandra LearningEhsan Javanmard

A Seminar on NoSQL Databases.Navdeep Charan

NoSQL (Non-Relational Databases)Ehsan Javanmard

Spark SQLJoud Khattab

Spark CoreTodd McGrath

An Overview of Apache SparkYasoda Jayaweera

Digital Transformation with Microsoft AzureLuan Moreno Medeiros Maciel

Lightening Fast Big Data Analytics using Apache SparkManish Gupta

NoSQL SeminerPartha Das

Building a REST API with Cassandra on Datastax Astra Using Python and NodeAnant Corporation

Apache spark - History and market overviewMartin Zapletal

Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraAnant Corporation

Databases and how to choose themDatio Big Data

Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup

Spark and scala course content | Spark and scala course online trainingSelfpaced

Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy

Cassandra Distributions and VariantsAnant Corporation

What's hot (20)

Performance of Spark vs MapReduce

The SparkSQL things you maybe confuse

Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker

Cassandra Learning

A Seminar on NoSQL Databases.

NoSQL (Non-Relational Databases)

Spark SQL

Spark Core

An Overview of Apache Spark

Digital Transformation with Microsoft Azure

Lightening Fast Big Data Analytics using Apache Spark

NoSQL Seminer

Building a REST API with Cassandra on Datastax Astra Using Python and Node

Apache spark - History and market overview

Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

Databases and how to choose them

Lighting up Big Data Analytics with Apache Spark in Azure

Spark and scala course content | Spark and scala course online training

Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Cassandra Distributions and Variants

Similar to Spark + Cassandra

Apache Spark 101Ankara Big Data Meetup

Apache Spark 101Abdullah Çetin ÇAVDAR

Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

In Memory Analytics with Apache SparkVenkata Naga Ravi

Apache Spark and DataStax EnablementVincent Poncet

TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy

FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan

Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation

Apache Spark RDDsDean Chen

An Introduction to Sparkjlacefie

An Introduct to Spark - Atlanta Spark Meetupjlacefie

Apache Spark Introductionsudhakara st

Azure Databricks is Easier Than You ThinkIke Ellis

Spark corePrashant Gupta

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher

An Introduction to Apache SparkElvis Saravia

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Similar to Spark + Cassandra (20)

Apache Spark 101

Big Data Analytics and Ubiquitous computing

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...

A look under the hood at Apache Spark's API and engine evolutions

In Memory Analytics with Apache Spark

Apache Spark and DataStax Enablement

TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark

Cassandra Lunch #89: Semi-Structured Data in Cassandra

Apache Spark RDDs

An Introduction to Spark

An Introduct to Spark - Atlanta Spark Meetup

Apache Spark Introduction

Azure Databricks is Easier Than You Think

Spark core

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

5 Ways to Use Spark to Enrich your Cassandra Environment

An Introduction to Apache Spark

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Recently uploaded

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

EY_Graph Database Powered SustainabilityNeo4j

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1

How to submit a standout Adobe Champion ApplicationBradBedford3

MYjobs Presentation Django-based projectAnoyGreter

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

What are the key points to focus on before starting to learn ETL Development....kzayra69

Introduction Computer Science - Software Design.pdfFerryKemperman

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

Recently uploaded (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Xen Safety Embedded OSS Summit April 2024 v4.pdf

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

EY_Graph Database Powered Sustainability

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

英国UN学位证,北安普顿大学毕业证书1:1制作

Implementing Zero Trust strategy with Azure

Best Web Development Agency- Idiosys USA.pdf

How to submit a standout Adobe Champion Application

MYjobs Presentation Django-based project

Cloud Data Center Network Construction - IEEE

Folding Cheat Sheet #4 - fourth in a series

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

What are the key points to focus on before starting to learn ETL Development....

Introduction Computer Science - Software Design.pdf

Odoo 14 - eLearning Module In Odoo 14 Enterprise

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

Spark + Cassandra

1. Spark + Cassandra Carl Yeksigian DataStax

2. Spark -Fast large-scale data processing framework -Focused on in-memory workloads -Supports Java, Scala, and Python -Integrated machine learning support (MLlib) -Streaming support -Simple developer API

3. Resilient Distributed Dataset (RDD) -Presents a simple Collection API to the developer -Breaks full collection into partitions, which can be operated on independently -Knows how to recalculate itself if data is lost -Abstracts how to complete a job from the tasks

4. RDD

5. RDD API

6. Partitions -Partitions can be created so they are on the same machine as the data

7. Uses for Spark with Cassandra -Ad-hoc queries -Joins, Unions across tables -Rewriting tables -Machine Learning

8. spark-cassandra-connector DataStax OSS Project https://github.com/datastax/spark-cassandra-connector

9. Spark Cassandra Connector -Exposes Cassandra tables as RDDs -Read from and write to Cassandra -Data type mapping -Scala and Java support

10. Spark + Bioinformatics -ADAM is a bioinformatics project out of UC Berkeley AMPLab -Combines Spark + Parquet + Avro https://github.com/bigdatagenomics/adam http://bdgenomics.org/

11. Simple Variant case class Variant ( sampleid: String, referencename: String, location: Long, allele: String) create table adam.variants ( sampleid ascii, referencename ascii, location bigint, allele ascii)

12. Connecting to Cassandra import com.datastax.spark.connector._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.345.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.345.10") val sc = new SparkContext(conf)

13. Saving To Cassandra val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0)) variants.flatMap(getVariant) .saveToCassandra("adam", "variants", AllColumns)

14. Querying Cassandra val rdd = sc.cassandraTable("adam", "variants") .map(r => (r.get[String]("allele"), 1L)) .reduceByKey(_ + _) .map(r => (r._2, r._1)) .sortByKey(ascending = false) rdd.collect() .foreach(bc => println("%40st%d".format(bc._2, bc._1)))

15. Thanks Acknowledgements: Timothy Danford (AMPLab) Matt Massie (AMPLab) Frank Nothaft (AMPLab) Jeff Hammerbacher (Cloudera/Mt Sinai)

Spark + Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark + Cassandra

Similar to Spark + Cassandra (20)

Recently uploaded

Recently uploaded (20)

Spark + Cassandra