3 Dundee-Spark Overview for C* developers

©2013 DataStax Conﬁdential. Do not distribute without consent.
@chbatey
Christopher Batey 
Spark overview for C* developers

@chbatey
Scalability & Performance
• Scalability
- No single point of failure
- No special nodes that become the bottle neck
- Work/data can be re-distributed
• Operational Performance i.e single digit ms
- Single node for query
- Single disk seek per query

@chbatey
Cassandra can not join or aggregate
Client
Where do I go for the max?

@chbatey
But but…
• Sometimes you don’t need a answers in milliseconds
• Data models done wrong - how do I fix it?
• New requirements for old data?
• Ad-hoc operational queries
• Managers always want counts / maxs

@chbatey
Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop
MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options

@chbatey
Components
Shark
or 
Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
Compatible

@chbatey
org.apache.spark.rdd.RDD
• Resilient Distributed Dataset (RDD)
• Created through transformations on data (map,filter..) or other RDDs
• Immutable
• Partitioned
• Reusable

@chbatey
RDD Operations
• Transformations - Similar to Scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
• Actions
• Require materialization of the records to generate a value
• collect: Array[T], count, fold, reduce..

@chbatey
Word count
val file: RDD[String] = sc.textFile("hdfs://...") 
val counts: RDD[(String, Int)] = file.flatMap(line =>
line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _)
 
counts.saveAsTextFile("hdfs://...")

Operator Graph: Optimisation and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD

@chbatey
Partitioning
• Large data sets from S3, HDFS, Cassandra etc
• Split into small chunks called partitions
• Each operation is done locally on a partition before
combining other partitions
• So partitioning is important for data locality

@chbatey
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs + Spark
DStreams

@chbatey
Analytics Workload Isolation

@chbatey
Deployment
• Spark worker in each of the
Cassandra nodes
• Partitions made up of LOCAL
cassandra data
S C
S C
S C
S C

@chbatey
It is on Github
"org.apache.spark" %% "spark-core" % sparkVersion 
 
"org.apache.spark" %% "spark-streaming" % sparkVersion 
 
"org.apache.spark" %% "spark-sql" % sparkVersion 
 
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion 
 
"com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion

@chbatey
Boiler plate
import com.datastax.spark.connector.rdd._ 
import org.apache.spark._ 
import com.datastax.spark.connector._ 
import com.datastax.spark.connector.cql._ 
 
object BasicCassandraInteraction extends App { 
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") 
val sc = new SparkContext("local[4]", "AppName", conf)
// cool stuff 
 
}
Cassandra Host
Spark master e.g spark://host:port

@chbatey
Executing code against the driver
CassandraConnector(conf).withSessionDo { session => 
session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }") 
session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)") 
session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)") 
session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)") 
session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)") 
}

@chbatey
Reading data from Cassandra
session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)") 
session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)") 
session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)") 
session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)") 
} 
 
val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv") 
println(rdd.max()(new Ordering[CassandraRow] { 
override def compare(x: CassandraRow, y: CassandraRow): Int =
x.getInt("value").compare(y.getInt("value")) 
}))

@chbatey
Word Count + Save to Cassandra
val textFile: RDD[String] = sc.textFile("Spark-Readme.md") 
 
val words: RDD[String] = textFile.flatMap(line => line.split("s+")) 
val wordAndCount: RDD[(String, Int)] = words.map((_, 1)) 
val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _) 
 
println(wordCounts.first()) 
 
wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))

@chbatey
Migrating from an RDMS
create table store(
store_name varchar(32) primary key,
location varchar(32),
store_type varchar(10));
create table staff(
name varchar(32)
primary key,
favourite_colour varchar(32),
job_title varchar(32));
create table customer_events(
id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
customer varchar(12),
time timestamp,
event_type varchar(16),
store varchar(32),
staff varchar(32),
foreign key fk_store(store) references store(store_name),
foreign key fk_staff(staff) references staff(name))

@chbatey
Denormalised table
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text, 
store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))

@chbatey
Migration time
 
val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)},  
"select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " + 
"and ce.id >= ? and ce.id <= ?", 0, 1000, 6, 
(r: ResultSet) => { 
(r.getString("customer"), 
r.getTimestamp("time"), 
UUID.randomUUID(), 
r.getString("event_type"), 
r.getString("store_name"), 
r.getString("location"), 
r.getString("store_type"), 
r.getString("staff"), 
r.getString("job_title") 
) 
}) 
 
customerEvents.saveToCassandra("test", "customer_events", 
SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name",
"staff_title"))

@chbatey
Issues with denormalisation
• What happens when I need to query the denormalised
data a different way?

@chbatey
Store it twice
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))
 
CREATE TABLE IF NOT EXISTS customer_events_by_staff(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
staff_name text,
staff_title text,
PRIMARY KEY ((staff_name), time, id))

@chbatey
My reaction a year ago

@chbatey
Too simple
val events_by_customer = sc.cassandraTable("test", “customer_events")
 
events_by_customer.saveToCassandra("test", "customer_events_by_staff",
SomeColumns("customer_id", "time", "id", "event_type", "staff_name",
"staff_title", "store_location", "store_name", "store_type"))

@chbatey
Aggregations with Spark SQL
Partition Key Clustering Columns

@chbatey
Now now…
val cc = new CassandraSQLContext(sc) 
cc.setKeyspace("test") 
val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events
GROUP BY store_name, event_type") 
rdd.collect().foreach(println)
[SportsApp,WATCH_STREAM,1]
[SportsApp,LOGOUT,1]
[SportsApp,LOGIN,1]
[ChrisBatey.com,WATCH_MOVIE,1]
[ChrisBatey.com,LOGOUT,1]
[ChrisBatey.com,BUY_MOVIE,1]
[SportsApp,WATCH_MOVIE,2]

@chbatey
Lamda architecture
http://lambda-architecture.net/

@chbatey
Network word count
session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)") 
session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)") 
}
 
val ssc = new StreamingContext(conf, Seconds(5)) 
val lines = ssc.socketTextStream("localhost", 9999) 
lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw") 
 
val words = lines.flatMap(_.split("s+")) 
val countOfOne = words.map((_, 1)) 
val reduced = countOfOne.reduceByKey(_ + _) 
reduced.saveToCassandra("test", "network_word_count")

@chbatey
Kafka
• Partitioned pub sub system
• Very high throughput
• Very scalable

@chbatey
Stream processing customer events
val joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY")) 
val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY")) 
val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL")) 
val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)") 
session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " + 
"customer_id text, " + 
"staff_id text, " + 
"store_type text, " + 
"group text static, " + 
"content text, " + 
"time timeuuid, " + 
"event_type text, " + 
"PRIMARY KEY ((customer_id), time) )") 
}

@chbatey
Save + Process
val rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]
(ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) 
val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) => 
parse(v).extract[CustomerEvent] 
}) 
events.saveToCassandra("streaming", "customer_events") 
val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _) 
 
eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")

@chbatey
Summary
• Cassandra is an operational database
• Spark gives us the flexibility to do slower things
- Schema migrations
- Ad-hoc queries
- Report generation
• Spark streaming + Cassandra allow us to build online
analytical platforms

@chbatey
Thanks for listening
• Follow me on twitter @chbatey
• Cassandra + Fault tolerance posts a plenty:
• http://christopher-batey.blogspot.co.uk/
• Github for all examples:
• https://github.com/chbatey/spark-sandbox
• Cassandra resources: http://planetcassandra.org/

3 Dundee-Spark Overview for C* developers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to 3 Dundee-Spark Overview for C* developers

Similar to 3 Dundee-Spark Overview for C* developers (20)

More from Christopher Batey

More from Christopher Batey (10)

Recently uploaded

Recently uploaded (20)

3 Dundee-Spark Overview for C* developers