Real Time Analytics with Dse

Real Time Analytics with
DataStax Enterprise
Ryan Knight
@Knight_Cloud
Solution Engineer - Datastax

© 2014 DataStax, All Rights Reserved
Introduction to Spark

Hadoop Limitations
• Master / Slave Architecture
• Every Processing Step requires Disk IO
• Difficult API and Programming Model
• Designed for batch-mode jobs
• No even-streaming / real-time
• Complex Ecosystem

Hadoop?

Apps in the early 2000s
were written for
Apps today
are written for
Single machines Clusters of machines
Single core processors Multicore processors
Expensive RAM Cheap RAM
Expensive disk Cheap disk
Slow networks Fast networks
Few concurrent users Lots of concurrent users
Small data sets Large data sets
Latency in seconds Latency in milliseconds

What is Spark?
• Fast and general compute engine for large-scale data
processing
• Fault Tolerant Distributed Datasets
• Distributed Transformation on Datasets
• Integrated Batch, Iterative and Streaming Analysis
• In Memory Storage with Spill-over to Disk

Advantages of Spark
• Improves efficiency through:
• In-memory data sharing
• General computation graphs - Lazy Evaluates Data
• 10x faster on disk, 100x faster in memory than
Hadoop MR
• Improves usability through:
• Rich APIs in Java, Scala, Py..??
• 2 to 5x less code
• Interactive shell

10© 2015. All Rights Reserved.
•Functional Paradigm is ideal for Data Analytics
•Strongly Typed - Enforce Schema at Every Later
•Immutable by Default - Event Logging
•Declarative instead of Imperative - Focus on
Transformation not Implementation
Scala for Data Analytics

Spark Streaming

Spark Versus Spark Streaming

Spark Streaming General Architecture

DStream Micro Batches

Windowing

Spark Cassandra Connector

Spark is about Data Analytics
• How do we get data into Spark?
• How can we work with large datasets?
• What do we do with the results of the analytics?

© 2014 DataStax, All Rights Reserved ●19
Spark Cassandra Connector uses the DataStax Java Driver to
Read from and Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of tokens

Connector Token Range Mapping
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of tokens

• Data locality-aware (speed)
• Read from and Write to Cassandra
• Cassandra Tables Exposed as RDD and DataFrames
• Server-Side ﬁlters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Mapping of Java Types to Cassandra Types

• Open Source Project
• Requires maintaining separate Cassandra and Spark
Clusters
• Spark Master is not Highly Available without
Zookeeper
• Submitting Spark Applications requires setting hard
coded Spark Master and Cassandra Locations

DataStax Enterprise Data Platform

© 2014 DataStax, All Rights Reserved.
Confidential
DataStax Enterprise Platform
Workload Segregation w/out ETL
24
Cassandra
OLTP Database
Analytics
Streaming and Analytics
Search
All Data Searchable
Graph
Graph Data Structure - Coming this year
C*
C
C
S A
A

Workload Segregation w/out ETL

•DSE Analytic Nodes conﬁgured to run Spark
•No need to run separate Spark Cluster
•Simpliﬁed Deployment and Management
•No need to specify Spark Master and Cassandra
Host
•High Availability of Spark Master
DSE Analytics with Spark
Internal / Administrative Benefits

Integrated Spark Analytics

•High Availability Spark Master with automatic leader
election
•Detects when Spark Master is down with gossip
•Uses Paxos to elect Spark Master
•Stores Spark Worker metadata in Cassandra
•No need to run Zookeeper
Spark Master High Availability

•Integration of Analytics and Search
•Spark Job Server
•SparkSQL and HiveQL access of Cassandra Data
•Streaming Resiliency with w/ Kafka Direct API via
Cassandra File System
DSE Analytics with Spark
Integration Benefits

DSE 4.8 Analytics + Search
• Allows Analytics Jobs to use Solr Queries
• Allows searching for data across partitions
val table = sc.cassandraTable("music","albums")
val result = table.select(“id","artist_name")
.where(“solr_query='artist_name:Miles*'")
.collect

31
DSE Analytics
Streaming Analysis
DSE Analytics
Batch Analysis
Data Center 1 - US East
Data Center 2 - US West
replication
replication
Data Center
Replication
Spark Streaming
from Kafka
DSE Analytics
Streaming Analysis
DSE Analytics
Batch Analysis
Spark Streaming
from Kafka
Passive Kafka
Active Kafka
Network Traffic Analysis Architecture

Common Use Cases
• Personalization
• Banking Fraud Detection
• Website Click Stream Analysis
• Login Monitoring

Spark Streaming Demo

Spark Notebook
C*
C
C A
AA
Notebook
Notebook
Notebook
Spark Notebook Server
Cassandra Cluster with Spark Connector

Apache Spark Notebook
•Reactive / Dynamic Graphs based on Scala, SQL and
DataFrames
•Spark Streaming
• Examples notebooks covering visualization, machine
learning, streaming, graph analysis, genomics analysis
•SVG / Sliders - interactive graphs
•Tune and Conﬁgure Each Notebook Separately
•https://github.com/andypetrella/spark-notebook

Demo of Streaming in the Real World -
Spark At Scale Project
•Based on Real World Use Cases
•Simulate a real world streaming use case
•Test throughput of Spark Streaming
•Best Practices for scaling
•https://github.com/retroryan/SparkAtScale

Spark At Scale
Demo Application
Web
Service
Legacy
Systems

Best Practices for Spark Streaming

Spark Streaming with Kafka Direct Approach
•Use Kafka Direct Approach (No Receivers)
•Queries Kafka Directly
•Automatically Parallelizes based on Kafka Partitions
•Exactly Once Processing - Only Move Offset after
Processing
•Resiliency without copying data

Spark Streaming Deployment
•Don’t build fat jars!!!!
•spark-submit —package specify dependencies maven
style
•Test submit options to match load
•--executor-memory 4G
•--total-executor-cores 15

How do we Scale for Load and Traffic?

Spark Streaming Monitoring
Processing Time
>
Batch Duration
=
Total Delay Grows
Out Of Memory Errors

Data Modeling using Event Sourcing
•Append-Only Logging
•Database of Facts
•Snapshots or Roll-Ups
•Why Delete Data any more?
•Replay Events

Spark SQL and DataFrames

• Creating and Running Spark Programs Faster
• Write less code
• Read less data
• Let the optimizer do the hard work
• Spark SQL Catalyst optimizer
Why Spark SQL?

• Distributed collection of data
• Similar to a Table in a RDBMS
• Common API for reading/writing data
• API for selecting, filtering, aggregating  
and plotting structured data
• Similar to a Table in a RDBMS
DataFrame

• Sources such as Cassandra, structured data
files, tables in Hive, external databases, or
existing RDDs.
• Optimization and code generation through the
Spark SQL Catalyst optimizer
• Decorator around RDD
• Previously SchemaRDD
DataFrame Part 2

• Unified interface to reading/writing data in a
variety of formats
• Spark Notebook Example
Write Less Code: Input & Output

Configuring Kafka for Scaling

Key to Scaling - Configuring Kafka Topics
•Number of Partitions per Topic — Degree of parallelism
•Directly Affects Spark Streaming Parallelism
•bin/kafka-topics.sh --create --zookeeper localhost:2181 --
replication-factor 1 --partitions 5 --topic ratings

Populating Kafka Topics
val record = new ProducerRecord[String, String]
(feederExtension.kafkaTopic, partNum, key,
nxtRating.toString)
 
val future = feederExtension.producer.send(record, new
Callback {

Streaming:
collect tweets
Twitter API
HDFS:
dataset
Spark SQL:
ETL, queries
MLlib:
train classifier
Spark:
featurize
HDFS:
model
Streaming:
score tweets
language
filter
Demo: Twitter Streaming Language Classifier
Cassandra
Cassandra

1. extract
text from
the tweet
https://
twitter.com/
andy_bf/status/
"Ceci n'est pas
un tweet"
2.
sequence
text as
tweet.sliding(2).t
oSeq
("Ce", "ec",
"ci", …, )
3. convert
bigrams
into
seq.map(_.hashCode
())
(2178, 3230,
3174, …, )
4. index
into
sparse tf
seq.map(_.hashCode
() % 1000)
(178, 230, 174,
…, )
5.
increment
feature
Vector.sparse(1000
, …)
(1000, [102,
104, …],
[0.0455, 0.0455,
From tweets to ML features,
approximated as sparse
vectors:

KMeans: Formal Deﬁnition (ignore this)

KMeans: How it really works…

Sample Code + Output: 
gist.github.com/ceteri/835565935da932cb59a2
val sc = new SparkContext(new SparkConf())
val ssc = new StreamingContext(conf, Seconds(5))

val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)
val statuses = tweets.map(_.getText)

val model = new KMeansModel(ssc.sparkContext.objectFile[Vector]
(modelFile.toString).collect())

val filteredTweets = statuses
.filter(t =>
model.predict(Utils.featurize(t)) == clust)
filteredTweets.print()

ssc.start()
ssc.awaitTermination()
CLUSTER 1:
TLあんまり⾒見ないけど
@くれたっら
いつでもくっるよ٩(δωδ)۶
そういえばディスガイアも今⽇日か
CLUSTER 4:
‫صدام‬ ‫بعد‬ ‫روحت‬ ‫العروبه‬ ‫قالوا‬
‫العروبه‬ ‫تحيى‬ ‫سلمان‬ ‫مع‬ ‫واقول‬
RT @vip588: √ ‫مي‬ ‫فولو‬ √ ‫متابعني‬ ‫زيادة‬ √ ‫االن‬ ‫للمتواجدين‬ vip588
√ ‫ما‬ ‫يلتزم‬ ‫ما‬ ‫اللي‬ √ ‫رتويت‬ ‫عمل‬ ‫للي‬ ‫فولو‬ √ ‫للتغريدة‬ ‫رتويت‬ √ ‫باك‬ ‫فولو‬
‫بيستفيد‬ …
‫سورة‬ ‫ن‬

Real Time Analytics with Dse

More Related Content

What's hot

Viewers also liked

Similar to Real Time Analytics with Dse

More from DataStax Academy

Recently uploaded

Real Time Analytics with Dse