Zero to Streaming: Spark and Cassandra

From 0 to Streaming
Cassandra and Spark Streaming
Russell Spitzer
+ =

Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop, Solr,
and SPARK!
• Spends a lot of time spinning up
clusters on EC2, GCE, Azure, … 
http://www.datastax.com/dev/blog/
testing-cassandra-1000-nodes-at-
a-time
• Writing FAQ’s for Spark
Troubleshooting 
http://www.datastax.com/dev/blog/
common-spark-troubleshooting

From 0 to Streaming
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit

From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark
How does it work?
Cluster Layout
Spark Submit

From 0 to Streaming
Connecting Cassandra To Spark
Spark SQL
RDD Basics
Spark Streaming
Streaming Basics
Writing Streaming Applications
Custom Receivers
Spark
How does it work?
Cluster Layout
Spark Submit

Part 1: What is Spark
Not this ^

Spark is a Distributed Analytics
Platform
HADOOP
•Has Generalized DAG execution
•Integrated SQL Queries
•Streaming
•Easy Abstraction for Datasets
•Support in lots of languages
All in one package!

Spark Provides a Simple and Efﬁcient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark 
Worker
Spark 
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition

Spark Provides a Simple and Efﬁcient
framework for Distributed Computations
Spark 
Worker
Spark 
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application

RDDs Can be Generated
from a Variety of Sources
Textﬁles
Parallelized Collections

Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect

rdd
=

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect
rdd
Create

rdd
=

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect
rdd rdd2
Transform

rdd
=

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect
rdd rdd2 rdd3
Transform

rdd rdd2 rdd3rdd
ACTION
rdd
=

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect

rdd rdd2 rdd3rdd
ACTION
rdd2
rdd
=

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect

rdd
=

val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)

val
rdd3
=
rdd2.filter(
_
>
4)

rdd3.collect
rdd rdd2 rdd3
ACTION

Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
Executor
Transformation
RDD’

1 32
4 5 6
7 8 9
RDD
Executor
1 1’
Executor
2 2’
Transformation
RDD’

1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 2’
Transformation
RDD’

1 32
4 5 6
7 8 9
RDD
Executor
3 3’
Executor
4 4’
1’ 2’
Transformation
RDD’

1 32
4 5 6
7 8 9
RDD
Executor
5 5’
Executor
6 6’
1’ 2’
Transformation
RDD’
3’
4’

1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 5’ 6’
7’ 8’ 9’
Transformation
RDD’

1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
5 5’
Node Failure

1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Partition
Reapply Transformation
5 5’
Node Failure

1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Partition
Reapply Transformation
5’
Because the actions on any partition can be tracked
backwards we can recover from failure without redoing the
entire RDD

Use the Spark Shell to
quickly try out code samples
Available in
and
Pyspark
Spark Shell

Spark Context is the Core Api for
all Communication with Spark
val conf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
new SparkContext(conf)
Almost all options can also be set as environment
variables or on the command line during spark-submit!

Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark conﬁguration property in key=value format.
spark-‐submit
-‐-‐class
MainClass
JarYouWantDistributedToExecutor.jar
Spark 
Worker
Spark 
Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar

Co-locate Spark and C* for
Best Performance
C*
C*C*
C*
Spark 
Worker
Spark 
Worker
Spark
Master
Spark
Worker
Running Spark Workers
on the same nodes as
your C* Cluster will save
network hops when
reading and writing

Use a Separate Datacenter
for your Analytics Workloads
C*
C*C*
C*
Spark 
Worker
Spark 
Worker
Spark
Master
Spark
Worker
C*
C*C*
C*
OLTP OLAP

Part 2: Connecting Spark To
Cassandra
Exactly like this ^

DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark-‐cassandra-‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled
and
Supported
with
DSE
>
4.5!

Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of
tokens

Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

Several Easy Ways To Use the
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python

Requirements for Following
Code Examples
The following examples use are targeted at
Spark 1.1.X
Cassandra 2.0.X
or if you are using DataStax Enterprise
DSE 4.6.x

Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);

Counting
use candy;
scala>
val
rdd
=
sc.cassandraTable("candy","inventory")

scala>
rdd.count

res13:
Long
=
4
cassandraTable

Counting
use candy;
scala>
val
rdd
=

scala>
rdd.count

res13:
Long
=
4
cassandraTable
count
4

Basics: take() and collect()
sc.cassandraTable("candy","inventory").take(1)

Array(CassandraRow{brand:
Wonka,
name:
Gobstopper,
amount:
10})


Wonka,
name:
Gobstopper,
amount:
10})
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10


Wonka,
name:
Gobstopper,
amount:
10})
sc.cassandraTable("candy","inventory").collect

Array[com.datastax.spark.connector.CassandraRow]
=
Array(CassandraRow{…})
cassandraTable
Wonka Gob 10


Wonka,
name:
Gobstopper,
amount:
10})
sc.cassandraTable("candy","inventory").collect

Array[com.datastax.spark.connector.CassandraRow]
=
Array(CassandraRow{…})
cassandraTable
Wonka Gob 10
cassandraTable
collect
9 NYC
Array of CassandraRows
9 NYC9 NYC9 NYCWonka Gob 10

Getting Values From Cassandra Rows
scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
cassandraTable

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
cassandraTable
Wonka Gob 10

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
10
get[Int]
cassandraTable
Wonka Gob 10

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
10
get[Int]
cassandraTable
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)

scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)

scala>
cassandraTable

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
10
get[Int]
cassandraTable
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)

scala>
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount

scala>

.take(1)(0)

.get[Int]("amount")

res5:
Int
=
10
10
get[Int]
cassandraTable
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)

scala>
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
amount
10

Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));

sc.cassandraTable[invRow]("candy","inventory")

.filter(
_.amount
<5)

.saveToCassandra("candy","low")


.filter(
_.amount
<5)

cassandraTable


.filter(
_.amount
<5)

cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount


.filter(
_.amount
<5)

cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
C*
C*C*
C*
Under the hood this is done via the
Cassandra Java Driver

Spark Sql Provides a Fast SQL
Like Syntax For Cassandra!
HQL
SQL
Catalyst
Query Plan
Grab Data
Filter
Group
Return Results
SchemaRDD
SQL In, RDD’s Out

Building a Context Object For
interacting with Spark SQL
In the DSE Spark Shell both HiveContext and Cassandra Sql Context are created
automatically on startup
import
org.apache.spark.sql.cassandra.CassandraSQLContext

val
sc:
SparkContext
=
...

val
csc
=
new
CassandraSQLContext(sc)
JavaSparkContext
jsc
=
new
JavaSparkContext(conf);

//
create
a
Cassandra
Spark
SQL
context

CassandraSQLContext
csc
=
new

CassandraSQLContext(jsc.sc());
Since HiveContext Requires the Hive Driver accessing C* Directly,
HC only available in DSE.
Workaround: get SchemaRDD’s with Cassandra Sql Context then Register with HC

Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(

"SELECT
*
FROM
candy.inventory").collect

Array[org.apache.spark.sql.Row]
=
Array(

[Wonka,Gobstopper,10],

[Wonka,WonkaBar,3],

[CandyTown,ChocoIsland,5],

[CandyTown,SugarMountain,2]

)
QueryPlan

Reading Data From
scala>
csc.sql(

"SELECT
*
FROM

=
Array(

[Wonka,Gobstopper,10],

[Wonka,WonkaBar,3],

[CandyTown,ChocoIsland,5],

[CandyTown,SugarMountain,2]

)
SchemaRDDQueryPlan

Counting Data From
scala>
csc.sql("SELECT
COUNT(*)
FROM

res5:
=
Array([4])

Joining Data From
scala>
csc.sql("

SELECT
*
FROM
candy.inventory
as
inventory

JOIN
candy.requests
as
requests

WHERE
inventory.name
=
requests.name").collect

res12:
=
Array(

[Wonka,WonkaBar,3,Russ,WonkaBar,2],

[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]

)

Insert to another Cassandra
Table
csc.sql("

INSERT
INTO
candy.low

SELECT
*
FROM
candy.inventory
as
inv

WHERE
inv.amount
<
5
").collect

Part 3: How To Stream To
Cassandra From Spark

Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy

Streaming is Cool
Like a Candy
You want it right now!

Streaming is Cool
Like a Candy
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.

Streaming is Cool
Like a Candy
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
Streaming Analytics:
We do our analytics on the data as it arrives.
The data won’t be stale and neither will our
analytics

DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Streaming involves a receiver or set of receivers each of which publishes a DStream

DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Batch Batch
RDD RDD RDD RDD
The DStream is (Discretized) into batches, the timing of which is set in the
Spark Streaming Context. Each Batch is made up of RDDs.

Streaming Provides Extra
Functions
DStream
RDD RDD RDD RDDRDD RDDRDD RDD
Time
Window 1-2
Window 2-3
Window 3-4
Windowing gives us easy access to slices of data in time

Receivers that Come With
Spark Streaming
And more!

Demo Streaming Application:
Analyze HttpRequests with Spark Streaming
Spark Cassandra
HttpServerTrafﬁc
Spark Executor
Source Included in DSE 4.6.0

Spark Receivers only really need to
describe how to publish to a DStream

case
class
HttpRequest(

timeuuid:
UUID,

method:
String,

headers:
Map[String,
List[String]],

uri:
URI,

body:
String)
extends
ReceiverClass
First we need to deﬁne a Case Class to make moving around
HttpRequest information Easier. This type will be used to
specify what type of DStream we are creating.

class
HttpReceiver(port:
Int)

extends
Receiver[HttpRequest]
(StorageLevel.MEMORY_AND_DISK_2)

with
Logging

{

def
onStart():
Unit
=
{}

def
onStop():
Unit
=
{}

}
Now we just need to write the code for a receiver to actually
publish these HttpRequest Objects
Receiver
[HttpRequest]

import
com.sun.net.httpserver.{

HttpExchange,
HttpHandler,
HttpServer}

def
onStart():
Unit
=
{

val
s
=
HttpServer.create(new
InetSocketAddress(p),
0)

s.createContext("/",
new
StreamHandler())

s.start()

server
=
Some(s)

}

def
onStop():
Unit
=
server
map(_.stop(0))

This will start up our server and direct all HttpTrafﬁc to be
handled by StreamHandler
Receiver
[HttpRequest]
HttpServer


class
StreamHandler
extends
HttpHandler
{

override
def
handle(transaction:
HttpExchange):
Unit
=
{

val
dataReader
=
new
BufferedReader(new

InputStreamReader(transaction.getRequestBody))

val
data
=
Stream.continually(dataReader.readLine).takeWhile(_
!=

null).mkString("n")

val
headers:
Map[String,
List[String]]
=

transaction.getRequestHeaders.toMap.map
{
case
(k,
v)
=>
(k,
v.toList)}

store(HttpRequest(

UUIDs.timeBased(),

transaction.getRequestMethod,

headers,

transaction.getRequestURI,

data))

transaction.sendResponseHeaders(200,
0)

val
response
=
transaction.getResponseBody

response.close()
//
Empty
response
body

transaction.close()
//
Finish
Transaction

}

}

StreamHandler actually does the work
publishing events to the DStream.
Receiver
[HttpRequest]
HttpServer
StreamHandler

Streaming Context sets Batch Timing
val
ssc
=
new
StreamingContext(conf,
Seconds(5))

val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>

ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))

}

val
requests
=
ssc.union(multipleStreams)

Create One Receiver Per Node
val
ssc
=
new
Seconds(5))

val
multipleStreams
=
(1
to
{
i
=>


}

val
requests
=
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler

Merge Separate DStreams into One
val
ssc
=
new
Seconds(5))

val
multipleStreams
=
(1
to
{
i
=>


}

val
requests
=
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
requests[HttpRequest]

Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(

timesegment
bigint
,

url
text,

t_uuid
timeuuid
,

method
text,

headers
map
<text,
text>,

body
text
,

PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
Persist Every Event That
Comes into the System

CREATE
TABLE
IF
NOT
EXISTS
timeline
(

timesegment
bigint
,

url
text,

t_uuid
timeuuid
,

method
text,

headers
map
<text,
text>,

body
text
,

PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
CREATE
TABLE
IF
NOT
EXISTS
method_agg(

url
text,

method
text,

time
timestamp,

count
bigint,

PRIMARY
KEY
((url,method),
time))
Table For Counting the
Number of Accesses to
each Url Over Time

CREATE
TABLE
IF
NOT
EXISTS
timeline
(

timesegment
bigint
,

url
text,

t_uuid
timeuuid
,

method
text,

headers
map
<text,
text>,

body
text
,

PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
CREATE
TABLE
IF
NOT
EXISTS
method_agg(

url
text,

method
text,

time
timestamp,

count
bigint,

PRIMARY
KEY
((url,method),
time))
CREATE
TABLE
IF
NOT
EXISTS
sorted_urls(

url
text,

time
timestamp,

count
bigint,

PRIMARY
KEY
(time,
count)

)
Table For Counting the
Number of Accesses to
each Url Over Time
Table for ﬁnding the most
popular url in each batch

Persist the events without doing any
manipulation
requests.map
{

request
=>

timelineRow(

timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,

url
=
request.uri.toString,

t_uuid
=
request.timeuuid,

method
=
request.method,

headers
=
request.headers.map
{
case
(k,
v)
=>
(k,

v.mkString("#"))},

body
=
request.body)

}.saveToCassandra("requests_ks",
"timeline")
Results

manipulation
requests.map
{

request
=>

timelineRow(

timesegment
=
/
10000L,

url
=

t_uuid
=
request.timeuuid,

method
=
request.method,

headers
=
request.headers.map
{
case
(k,
v)
=>
(k,

v.mkString("#"))},

body
=
request.body)

"timeline")
timesegment url t_uuid method Headers Body
Results
timelineRow

manipulation
requests.map
{

request
=>

timelineRow(

timesegment
=
/
10000L,

url
=

t_uuid
=
request.timeuuid,

method
=
request.method,

headers
=
request.headers.map
{
case
(k,
v)
=>
(k,

v.mkString("#"))},

body
=
request.body)

"timeline")
C*
C*C*
C*
timesegment url t_uuid method Headers Body
Results
timelineRow

Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})

.map
{
case
((m,
u),
c,
t)
=>

methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}

.saveToCassandra("requests_ks",
"method_agg")
method uri

URI and Method
=>
(request.method,

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
((m,
u),
c)
=>
((m,
u),
c,

.map
{
case
((m,
u),
c,
t)
=>

methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}

"method_agg")
method uri
method uri count
CountByValue

method uri count time
URI and Method
=>
(request.method,

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
((m,
u),
c)
=>
((m,
u),
c,

.map
{
case
((m,
u),
c,
t)
=>

methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}

"method_agg")
method uri
method uri count
countByValue
transform

method uri count time
URI and Method
=>
(request.method,

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
((m,
u),
c)
=>
((m,
u),
c,

.map
{
case
((m,
u),
c,
t)
=>

methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}

"method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra

Sort Aggregates by Batch
=>
(request.uri.toString))

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
(u,
c)
=>
(u,
c,

.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}

"sorted_urls")
uri

=>

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
(u,
c)
=>
(u,
c,

.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}

"sorted_urls")
uri
uri count
countByValue

=>

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
(u,
c)
=>
(u,
c,

.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}

"sorted_urls")
uri
uri count
uri count time
countByValue
transform

=>

.countByValue()

.transform((rdd,
time)
=>
rdd.map
{

case
(u,
c)
=>
(u,
c,

.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}

"sorted_urls")
uri
uri count
uri count time
countByValue
transform
Let Cassandra
Do the Sorting! PRIMARY KEY (time, count)
C*
C*C*
C*saveToCassandra

Start the application!

ssc.start()

ssc.awaitTermination()
This will start the streaming application
piping all incoming data to Cassandra!

Live Demo
Demo run Script
#Start Streaming Application
echo "Starting Streaming Receiver(s): Logging to http_receiver.log"
cd HttpSparkStream
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d
$NUM_SPARK_NODES > ../http_receiver.log 2>&1 &
cd ..
echo "Waiting for 60 Seconds for streaming to come online"
sleep 60
#Start Http Requester
echo "Starting to send requests against streaming receivers: Logging to http_requester.log"
cd HttpRequestGenerator
./sbt/sbt "run -i $SPARK_NODE_IPS " > ../http_requester.log 2>&1 &
cd ..
#Monitor Results Via Cqlsh
watch -n 5 './monitor_queries.sh'

I hope this gives you some
exciting ideas for your
applications!
Questions?

Thanks for coming to the meetup!!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual ofﬁce hours running weekly!
Email us: Community@DataStax.com!
Getting started with Cassandra?!
In production?!
Tweet us: @PlanetCassandra!

Zero to Streaming: Spark and Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Zero to Streaming: Spark and Cassandra

Similar to Zero to Streaming: Spark and Cassandra (20)

Recently uploaded

Recently uploaded (20)

Zero to Streaming: Spark and Cassandra