SlideShare a Scribd company logo
Cassandra and Spark
powerful big data processing and storage combined
Alex Thompson @ Datastax Sydney - Spark Meetup May 2016 @ Macquarie Bank
Parallel execution*
The core of any parallel programming framework or parallel programming language is the function and
some very simple rules:
1. A function must take at least one argument and return a value.
2. Variables used within a function or passed to a function can have no other scope than that of the
function.
The passing of arguments to a function is the “Message Passing” part you will often seen referred to in
parallel programming languages and frameworks.
* you will see Parallel execution referred to as parallel programming, distributed computing, cluster computing. You require a functional programming
language to work within this paradigm, or a language that has been retro fitted to work in parallel environments.
Functional programming example
function myFunction(arg1, arg2)
{
var1 = 1;
var2 = 2;
var3 = var1 * var2;
return var3;
}
Because the above function is self-encapsulated, i.e. it receives a message, it performs some work and it returns a value
it can be run on any core or any compute node asynchronously in isolation from other processes.
A parallel programming framework can throw this function at an empty or underutilised core or node for processing, thus
distributing its workload across many servers.
Functional programming execution
1
23
myFunction(arg1, arg2)function_1()
function_2()
function_3()
...
Parallel framework distributes functions to compute nodes for execution.
Examples of parallel capable frameworks: Scala AKKA, Erlang VM, C-MPI, C-Linda etc
Functions can be pushed to nodes in a variety of ways, based on load, availability, simple round-robin etc
Many patterns exist for design of parallel systems e.g. MPI and Actor pattern
But all involve some form of message passing.
push function to compute node
Hadoop split/job architecture
Hadoop was originally a file based general purpose distributed file system and parallel execution
environment developed to pull apart large web search logs by splitting up files, distributing them to
compute nodes and passing a java application to those nodes to work over the split file.
1 2 3 4
M
data
data
split
.jar
data
split
.jar
data
split
.jar
data
split
.jar
C* and Spark split/job architecture
You can do the same thing with Spark, but when you introduce C* into the equation, your splits are
already complete, your splits are the partitioning of data across the nodes in a C* ring:
1 2 3 4
M
.jar .jar .jar .jar
table
data
table
data
table
data
table
data
The Spark stack
Spark workers and executors are deployed directly on C* nodes, they run in a separate JVM but have
direct access (no network hops) to the C* resident data on the local node.
Node 1
Cassandra
Spark is installed on same node as C*
Spark Worker
Executor Task Task
Executor Task Task
table
data
Driver Cluster Manager
Node 2 ...
The Execution Stack
Spark, Scala, C* and SparkC*Driver versions
Apache Spark, Typesafe Scala, Apache Cassandra and the SparkCassandraDriver* are all very fast
moving projects, you will see major point releases every couple of months on some of these projects, so
you have to be very aware of the versions of the software stack when producing a solution.
The Solution Stack
scala
code
scala
libraries
spark
libraries
driver*
scala build tool (sbt) spark
Memory allocation and Spark
Memory settings are required at each level:
● Driver application
● Spark master
● Workers
● Executors
The defaults are sufficient for ‘hello world’ type applications and light weight processing, usually you will need more memory
allocated to workers and executors down at node level in the real world. All processing should be pushed down to the nodes
where possible, if you find your Driver application is spiralling upward in RAM requirements or timing out you are probably
doing something wrong like using take() or collect() at driver level.
Set up your development environment 1
The following is based on:
spark-cassandra-connector 1.2
spark 1.2
hadoop 1
scala version 2.10
Download the spark-cassandra-connector .zip file from the github project:
https://github.com/datastax/spark-cassandra-connector
Unpack it and place it in the /opt directory and cd into it
Download sbt-launch.jar from:
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/{version}/sbt-launch.jar
And place it in the spark-cassandra-connector/sbt directory.
Set up your development environment 2
Build the connector:
run /opt/spark-cassandra-connector-master> java -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m -jar
sbt/sbt-launch.jar "assembly"
This will build a standard scala connector with all dependencies in:
/opt/spark-cassandra-connector-master/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector
-assembly-*.jar
And will build a java connector with all dependencies in:
/opt/spark-cassandra-connector-master/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-a
ssembly-1.3.0-SNAPSHOT.jar
Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf:
spark.executor.extraClassPath
spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$C
urrentVersion-SNAPSHOT.jar
Set up your development environment 3
install sbt on your OS (here Ubuntu):
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-get update
sudo apt-get install sbt
create a hello world project and run it:
http://www.scala-sbt.org/0.13/tutorial/Hello.html
and run it:
>cd /opt/spark-cassandra-connector-project
>sbt
>run
Hi!
Create a C* project
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object CassandraTest {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val rdd = sc.cassandraTable("test", "users")
println(rdd.count)
println(rdd.first)
}
}
Start your Spark Master and Workers
Start a standalone master server by executing:
>cd /opt/spark-1.2.1-bin-hadoop1
>./sbin/start-master.sh
(Note: Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to
SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.)
Start one or more workers and connect them to the master via:
>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
(./bin/spark-class org.apache.spark.deploy.worker.Worker spark://dse-vm:7077) - spark://dse-vm:7077 is the address of the master
(Note: Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its
number of CPUs and memory (minus one gigabyte left for the OS)).
Submit your job:
>cd /opt/spark-cassandra-connector-project
>sbt
You will now be given a list of jobs that sbt has found, choose the one you want and submit it and it will run.
Joins in SparkSQL:
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.{SparkConf, SparkContext}
object SparkSqlQuery extends CassandraCapable {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setJars(Array("target/scala-2.10/spark_bulk_ops-assembly.jar"))
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val connector = CassandraConnector(conf)
//inserts...
val cassandraContext = new CassandraSQLContext(sc)
val rdd = cassandraContext.sql("select t.tag, count(*) as cnt from activity_stream_api.activity a " +
"join activity_stream_api.tag_activity t on t.activity_id = a.activity_id group by t.tag order by cnt");
rdd.collect().foreach(f => println(f))
}
}
Joins in SparkSQL, Bulk import / export into C* - everything you wanted to do with C* and Spark but were
too afraid to try, visit the following github example code site, it also includes a full C* application template
you can use for your own systems:
https://github.com/rssvihla/spark_commons
Latest Versions and compatibility
Connector Spark Cassandra Cassandra Java Driver
1.6 1.6 2.1.5*, 2.2, 3.0 3.0
1.5 1.5, 1.6 2.1.5*, 2.2, 3.0 3.0
1.4 1.4 2.1.5* 2.1
1.3 1.3 2.1.5* 2.1
1.2 1.2 2.1, 2.0 2.1
1.1 1.1, 1.0 2.1, 2.0 2.1
1.0 1.0, 0.9 2.0 2.0
*Compatible with 2.1.X where X >= 5

More Related Content

What's hot

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 

Similar to Apache Cassandra and Apche Spark

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 

Similar to Apache Cassandra and Apche Spark (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Spark core
Spark coreSpark core
Spark core
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Module01
 Module01 Module01
Module01
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Scala+data
Scala+dataScala+data
Scala+data
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
maXbox Starter87
maXbox Starter87maXbox Starter87
maXbox Starter87
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 

More from Alex Thompson

More from Alex Thompson (6)

The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
 
Apache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoringApache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoring
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache Cassandra
 
Apache Cassandra - Data modelling
Apache Cassandra - Data modellingApache Cassandra - Data modelling
Apache Cassandra - Data modelling
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 

Apache Cassandra and Apche Spark

  • 1. Cassandra and Spark powerful big data processing and storage combined Alex Thompson @ Datastax Sydney - Spark Meetup May 2016 @ Macquarie Bank
  • 2. Parallel execution* The core of any parallel programming framework or parallel programming language is the function and some very simple rules: 1. A function must take at least one argument and return a value. 2. Variables used within a function or passed to a function can have no other scope than that of the function. The passing of arguments to a function is the “Message Passing” part you will often seen referred to in parallel programming languages and frameworks. * you will see Parallel execution referred to as parallel programming, distributed computing, cluster computing. You require a functional programming language to work within this paradigm, or a language that has been retro fitted to work in parallel environments.
  • 3. Functional programming example function myFunction(arg1, arg2) { var1 = 1; var2 = 2; var3 = var1 * var2; return var3; } Because the above function is self-encapsulated, i.e. it receives a message, it performs some work and it returns a value it can be run on any core or any compute node asynchronously in isolation from other processes. A parallel programming framework can throw this function at an empty or underutilised core or node for processing, thus distributing its workload across many servers.
  • 4. Functional programming execution 1 23 myFunction(arg1, arg2)function_1() function_2() function_3() ... Parallel framework distributes functions to compute nodes for execution. Examples of parallel capable frameworks: Scala AKKA, Erlang VM, C-MPI, C-Linda etc Functions can be pushed to nodes in a variety of ways, based on load, availability, simple round-robin etc Many patterns exist for design of parallel systems e.g. MPI and Actor pattern But all involve some form of message passing. push function to compute node
  • 5. Hadoop split/job architecture Hadoop was originally a file based general purpose distributed file system and parallel execution environment developed to pull apart large web search logs by splitting up files, distributing them to compute nodes and passing a java application to those nodes to work over the split file. 1 2 3 4 M data data split .jar data split .jar data split .jar data split .jar
  • 6. C* and Spark split/job architecture You can do the same thing with Spark, but when you introduce C* into the equation, your splits are already complete, your splits are the partitioning of data across the nodes in a C* ring: 1 2 3 4 M .jar .jar .jar .jar table data table data table data table data
  • 8. Spark workers and executors are deployed directly on C* nodes, they run in a separate JVM but have direct access (no network hops) to the C* resident data on the local node. Node 1 Cassandra Spark is installed on same node as C* Spark Worker Executor Task Task Executor Task Task table data Driver Cluster Manager Node 2 ... The Execution Stack
  • 9. Spark, Scala, C* and SparkC*Driver versions Apache Spark, Typesafe Scala, Apache Cassandra and the SparkCassandraDriver* are all very fast moving projects, you will see major point releases every couple of months on some of these projects, so you have to be very aware of the versions of the software stack when producing a solution. The Solution Stack scala code scala libraries spark libraries driver* scala build tool (sbt) spark
  • 10. Memory allocation and Spark Memory settings are required at each level: ● Driver application ● Spark master ● Workers ● Executors The defaults are sufficient for ‘hello world’ type applications and light weight processing, usually you will need more memory allocated to workers and executors down at node level in the real world. All processing should be pushed down to the nodes where possible, if you find your Driver application is spiralling upward in RAM requirements or timing out you are probably doing something wrong like using take() or collect() at driver level.
  • 11. Set up your development environment 1 The following is based on: spark-cassandra-connector 1.2 spark 1.2 hadoop 1 scala version 2.10 Download the spark-cassandra-connector .zip file from the github project: https://github.com/datastax/spark-cassandra-connector Unpack it and place it in the /opt directory and cd into it Download sbt-launch.jar from: http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/{version}/sbt-launch.jar And place it in the spark-cassandra-connector/sbt directory.
  • 12. Set up your development environment 2 Build the connector: run /opt/spark-cassandra-connector-master> java -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m -jar sbt/sbt-launch.jar "assembly" This will build a standard scala connector with all dependencies in: /opt/spark-cassandra-connector-master/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector -assembly-*.jar And will build a java connector with all dependencies in: /opt/spark-cassandra-connector-master/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-a ssembly-1.3.0-SNAPSHOT.jar Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf: spark.executor.extraClassPath spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$C urrentVersion-SNAPSHOT.jar
  • 13. Set up your development environment 3 install sbt on your OS (here Ubuntu): echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-get update sudo apt-get install sbt create a hello world project and run it: http://www.scala-sbt.org/0.13/tutorial/Hello.html and run it: >cd /opt/spark-cassandra-connector-project >sbt >run Hi!
  • 14. Create a C* project import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ object CassandraTest { def main(args: Array[String]) { val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf) val rdd = sc.cassandraTable("test", "users") println(rdd.count) println(rdd.first) } }
  • 15. Start your Spark Master and Workers Start a standalone master server by executing: >cd /opt/spark-1.2.1-bin-hadoop1 >./sbin/start-master.sh (Note: Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.) Start one or more workers and connect them to the master via: >./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT (./bin/spark-class org.apache.spark.deploy.worker.Worker spark://dse-vm:7077) - spark://dse-vm:7077 is the address of the master (Note: Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS)). Submit your job: >cd /opt/spark-cassandra-connector-project >sbt You will now be given a list of jobs that sbt has found, choose the one you want and submit it and it will run.
  • 16. Joins in SparkSQL: import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext import org.apache.spark.{SparkConf, SparkContext} object SparkSqlQuery extends CassandraCapable { def main(args: Array[String]): Unit = { val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") .setJars(Array("target/scala-2.10/spark_bulk_ops-assembly.jar")) val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf) val connector = CassandraConnector(conf) //inserts... val cassandraContext = new CassandraSQLContext(sc) val rdd = cassandraContext.sql("select t.tag, count(*) as cnt from activity_stream_api.activity a " + "join activity_stream_api.tag_activity t on t.activity_id = a.activity_id group by t.tag order by cnt"); rdd.collect().foreach(f => println(f)) } }
  • 17. Joins in SparkSQL, Bulk import / export into C* - everything you wanted to do with C* and Spark but were too afraid to try, visit the following github example code site, it also includes a full C* application template you can use for your own systems: https://github.com/rssvihla/spark_commons
  • 18. Latest Versions and compatibility Connector Spark Cassandra Cassandra Java Driver 1.6 1.6 2.1.5*, 2.2, 3.0 3.0 1.5 1.5, 1.6 2.1.5*, 2.2, 3.0 3.0 1.4 1.4 2.1.5* 2.1 1.3 1.3 2.1.5* 2.1 1.2 1.2 2.1, 2.0 2.1 1.1 1.1, 1.0 2.1, 2.0 2.1 1.0 1.0, 0.9 2.0 2.0 *Compatible with 2.1.X where X >= 5