How Spark is Enabling the New Wave of Converged Applications

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
How Spark is Enabling
the New Wave of Converged Applications
Balaji Mohanam and Carol McDonald
September, 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Today’s Presenters
Carol McDonald
Solutions Architect
Balaji Mohanam
Product Manager

Agenda
• Market Trends
• What’s Needed for Converged Applications
• Customer Use Cases
• Demo of MapR Streams with Spark Streaming

Analytics & ETL: Batch or Streaming?
V a l u e
T i m e

Analytic Categories
Descriptive Predictive Streaming Prescriptive
Data-At-Rest Data-In-Motion Future
• What happened
• Why did it happen
• Discovery in nature
• Batch analytics
• What will happen
• Combines historical data
with rules and algorithms
• ML (Batch + Real Time)
• What + When + Why
• Suggestions to take
advantage of future
opportunity or mitigate risks
• Volume, velocity and variety
• Agility is key to success.
• Analyse data as it happens
• Triggers and Alarms.
• Anomaly detection
• Continuous ETL and analytics

Decreasing Job Latencies
Hours Mins Secs Milli Secs
Data persistence
on-disk
Data persistence
in-memory

It was hot at
6:05 yesterday!
Why Stream Processing?
A n a l y z e
6:01 P.M.:
72°
6:02 P.M.:
75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
Batch processing may be too late for some events

Why Stream Processing?
6:05 P.M.: 90°
To
pi
c
Temperature
Turn on the air
conditioning!
It’s becoming important to process events as they arrive
S t r e a m

© 2016 MapR Technologies 9© 2016 MapR Technologies 9© 2016 MapR Technologies© 2016 MapR Technologies
What’s Needed for Converged Applications

The Trinity of Real Time
Topic 1
Real Time
Producers
Topic 2
Global Messaging System No SQL Key Value
Database
Spark +
MapR DB
Integration
Real Time
Operational
Analytics
Transformational Tier
Spark +
MapR Streams
Integration

Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
DataProcessing
Web-Scale Storage
MapR-FS MapR-DB
Search and
Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and
Managed
Services
Search and
Others
UnifiedManagementandMonitoring
Search and
Others
Event StreamingDatabase
Custom
Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform

Use Case: Time Series Data in Oil Wells
Data for
real-time monitoring
read
Sensor
time-stamped
data
Spark
processing
Spark
Streaming
Stream
Topic

Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?

Scalable Messaging with MapR Streams
Topics are partitioned for throughput and scalability
Partition 1: Topic - Pressure
Partition 1: Topic - Temperature
Partition 1: Topic - Warning
Consumers
Consumers
Consumers
!

Continuous Analytics:
Structured Streaming with Spark 2.0
valrecords=sqlContext.read.format(“json”).stream(“hdfs://input”)
valcounts=records.groupBy(“user”).count() counts.write
.trigger(ProcessingTime(“5sec”)) .outputMode(UpdateInPlace(“user”))
.format(“jdbc”) .startStream(“mysql://...”)
Repeated Queries
DB
User Count
User 1 10
User 2 23
User 3 16
…….. ……..
Store only the processed output
instead of every single record.• Query executed repeatedly as and when the data arrives.
• Read the result from persistent storage, instead of processing the entire data set, resulting in
faster access.

Spark 2.0: Structured Streaming with Spark SQL
Processing Time
1
Input Table
Result Table
Program Output Complete
output
OR Delta output
Output for
data at 1
Output for
data at 2
Output for
data at 3
Data upto
proc. Time 1
Data upto
proc. Time 2
Data upto
proc. Time 3
Delta: writes the records from the query
result changed from the last firing of the
trigger. These are physical deltas and not
logical deltas. That is to say, they specify
what rows were added and removed, but
not the logical difference for some row.
Append: A special case of the Delta mode
that does not include removals.
Update(inplace): Update the result
directly in place (e.g. update a MySQL
table). Similar to delta, a primary key must
be specified.
Complete: For each run of the query, create
a complete snapshot of the query result.
Output Modes32

Serve Data
Store DataCollect Data Process DataData Sources
Stream
Topic

User 1
User 2
User 3
User n
.
.
.
Sparkcontext
Query Compilation
Storage
Scheduling
Worker 1
Worker 2
Worker 3
Worker 4
Worker n
.
.
Spark Scheduling Bottleneck

Latency vs. Concurrency
Type Latency Concurrency
Batch/RTS Analytics Very Low Low
Interactive Applications Very Low High/Very High

MapR-DB (HBase API) is Designed to Scale
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Fast Reads and Writes by Key
Data is automatically partitioned
by Key Range
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val

Serve DataStore DataCollect Data
What Do We Exactly Need to Do ?
Process DataData Sources
Stream
Topic

Customer Use Cases

Customer 360 & Behavior Prediction
Website
Click-Stream
Real Time/Offline
ClickStream Analysis
Internal Data Sources
External Data Sources
• Prediction Modelling
• Attribution Modelling
• Cohort Analysis
• Customer Lifetime Value
Analysis
• Attrition Modelling
• Response Modelling
• Churn Modelling
Eliminate latency due to data
movement between clusters
Eliminate Redundant storage with
MapR streams and lower the TCO
360 Degree
Customer View
Customer Behavior Prediction
Better Conversion Rate and Lower attrition $$$
Offline
Real Time
HA, DR, NFS, Snapshots,
Data Protection
EDH/EDL
Topic
Topic
Topic
Topic
Support
Tickets
DBMSEmail
CRM

Prescriptive Analytics: IoT & Auto Manufacturing
GPS
Telemati
c Data
Telephone Truck Fleet
Data generated from cars are
stored locally
Data Modelling/Secondary ETL: Data is
converted from proprietary to parquet format
• Identify emission patterns
• Route optimization
• Customer service requests
• How does throttling affect other factors such as fuel consumption, emissions, etc.
• Image and video analysis
• Time series analysis for threshold breach
Topic
Topic
Topic
Topic

Interactive Analytics: Risk Analysis ( Internal Users)
0-10 days old data cached in memory:
50-100 GB of data.
Data older than 10 days
accessed from disk
Analytic Application
to submit queries
with simple to medium
analytic query complexity
User 1
User 2
User 3
Concurrent requests: 3-10
Throughput: 1.5 requests per
second
Latency : < 2 secondsRepresentative Queries
• List of users who have spent more
than $1000 in last 3 days.
• Group users by country who spent
more than $1000.
Analytic Application
Type of Users: Internal

On-Demand
Pre-Computed
Interactive Analytics: External Customer Facing
Application
Sales Incentive
Data
• 60 events/sec
• 10 MB/event
• Tabled based topics
Fast Changing Data
Ex: Credit date
Append only (50% of events)
Search Application
Stale Data. Aggregates
calculated using Snapshots.
Level 1 Aggregates
Level 2 Aggregates
Level 3 Aggregates
Advanced ML Analytics
Delta Aggregates
Pre-compute
analytics with
Spark Streaming on
Data-in-motion
Topic
Topic
Topic
Topic
DB

Demo

What if BP had detected problems before
the oil hit the water ?
1M samples/sec
High performance at
scale is necessary!

Use Case: Time Series Data
Data for
Sensor
time-stamped data
Spark
processing
readSpark
Streaming
Stream
Topic

Use Case: Time Series Data
Sensor
time-stamped data
Stream
Topic
COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94
COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
Data: PumpId, Date,Time , pressure and flow measurements

Schema
• All events stored, CF data could be set to expire data
• Filtered alerts put in CF alerts
• Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil
pump name, date, and
a time stamp

Schema
Row key
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0

Serve Data
Data Sources Store DataCollect Data Process Data
Stream
Topic

readSpark
Streaming
Stream
Topic
Use Case Example Code
Data for
Sensor
time-stamped data Spark processing

KafkaProducer
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
//1 configure KafkaProducer properties
Properties properties = new Properties();
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
//2 Create KafkaProducer with properties
kafkaProducer = new KafkaProducer<String, String>(properties);
String txt = “msg text”;
//3 Create producer records with topic and message
ProducerRecord<String, String> record = new
ProducerRecord<String, String>(topic, txt);
//4 use kafka producer to send records
kafkaProducer.send(record);

readSpark
Streaming
Stream
Topic
Use Case Example Code
Data for
Sensor
time-stamped data Spark processing

Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))
// create an input Stream for set of topics
val dStream = KafkaUtils.createDirectStream[String,
String](ssc, kafkaParams, topicsSet)
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD

Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
// Parse CSV Strings into Sensor objects
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}

Process DStream
// Parse message values into Sensor objects
val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
sensorDStream RDDs
New RDDs created
for every batch
map map map

DataFrame and SQL Operations
// for Each RDD
sensorDStream.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
// convert RDD to DataFrame
rdd.toDF().registerTempTable("sensor")
// get the avg max min for pump values
val res = sqlContext.sql( "SELECT resid, date,
max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz,
max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp,
max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo,
max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi
FROM sensor GROUP BY resid,date”)
res.show()
}

Streaming Application Output

Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist
data to external storage
Put objects written
to HBase
batch
time 2-3
batch
time 1 to 2
batch
time 0 to 1
mapmap map
savesave save

Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()

Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing

Q&AEngage with us!
1. Read explanation of and Download code
– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db
– https://www.mapr.com/blog/spark-streaming-hbase
2. Get Started: MapR Converged Data Platform
https://www.mapr.com/get-started-with-mapr
3. Get Answers: MapR Converge Community
https://community.mapr.com/community/answers
4. Get Trained: MapR On-Demand Training
https://learn.mapr.com

How Spark is Enabling the New Wave of Converged Applications

More Related Content

What's hot

Viewers also liked

Similar to How Spark is Enabling the New Wave of Converged Applications

More from MapR Technologies

Recently uploaded

How Spark is Enabling the New Wave of Converged Applications