© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
How Spark is Enabling
the New Wave of Converged Applications
Balaji Mohanam and Carol McDonald
September, 2016
© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Today’s Presenters
Carol McDonald
Solutions Architect
Balaji Mohanam
Product Manager
© 2016 MapR Technologies 3© 2016 MapR Technologies 3
Agenda
• Market Trends
• What’s Needed for Converged Applications
• Customer Use Cases
• Demo of MapR Streams with Spark Streaming
© 2016 MapR Technologies 4© 2016 MapR Technologies 4
Analytics & ETL: Batch or Streaming?
V a l u e
T i m e
© 2016 MapR Technologies 5© 2016 MapR Technologies 5
Analytic Categories
Descriptive Predictive Streaming Prescriptive
Data-At-Rest Data-In-Motion Future
• What happened
• Why did it happen
• Discovery in nature
• Batch analytics
• What will happen
• Combines historical data
with rules and algorithms
• ML (Batch + Real Time)
• What + When + Why
• Suggestions to take
advantage of future
opportunity or mitigate risks
• Volume, velocity and variety
• Agility is key to success.
• Analyse data as it happens
• Triggers and Alarms.
• Anomaly detection
• Continuous ETL and analytics
© 2016 MapR Technologies 6© 2016 MapR Technologies 6
Decreasing Job Latencies
Hours Mins Secs Milli Secs
Data persistence
on-disk
Data persistence
in-memory
© 2016 MapR Technologies 7© 2016 MapR Technologies 7
It was hot at
6:05 yesterday!
Why Stream Processing?
A n a l y z e
6:01 P.M.:
72°
6:02 P.M.:
75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
Batch processing may be too late for some events
© 2016 MapR Technologies 8© 2016 MapR Technologies 8
Why Stream Processing?
6:05 P.M.: 90°
To
pi
c
Temperature
Turn on the air
conditioning!
It’s becoming important to process events as they arrive
S t r e a m
© 2016 MapR Technologies 9© 2016 MapR Technologies 9© 2016 MapR Technologies© 2016 MapR Technologies
What’s Needed for Converged Applications
© 2016 MapR Technologies 10© 2016 MapR Technologies 10
The Trinity of Real Time
Topic 1
Real Time
Producers
Topic 2
Global Messaging System No SQL Key Value
Database
Spark +
MapR DB
Integration
Real Time
Operational
Analytics
Transformational Tier
Spark +
MapR Streams
Integration
© 2016 MapR Technologies 11© 2016 MapR Technologies 11
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
DataProcessing
Web-Scale Storage
MapR-FS MapR-DB
Search and
Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and
Managed
Services
Search and
Others
UnifiedManagementandMonitoring
Search and
Others
Event StreamingDatabase
Custom
Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform
© 2016 MapR Technologies 12© 2016 MapR Technologies 12
Use Case: Time Series Data in Oil Wells
Data for
real-time monitoring
read
Sensor
time-stamped
data
Spark
processing
Spark
Streaming
Stream
Topic
© 2016 MapR Technologies 13© 2016 MapR Technologies 13
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
© 2016 MapR Technologies 15© 2016 MapR Technologies 15
Scalable Messaging with MapR Streams
Topics are partitioned for throughput and scalability
Partition 1: Topic - Pressure
Partition 1: Topic - Temperature
Partition 1: Topic - Warning
Partition 2: Topic - Pressure
Partition 2: Topic - Temperature
Partition 2: Topic - Warning
Partition 3: Topic - Pressure
Partition 3: Topic - Temperature
Partition 3: Topic - Warning
Consumers
Consumers
Consumers
!
© 2016 MapR Technologies 16© 2016 MapR Technologies 16
Continuous Analytics:
Structured Streaming with Spark 2.0
valrecords=sqlContext.read.format(“json”).stream(“hdfs://input”)
valcounts=records.groupBy(“user”).count() counts.write
.trigger(ProcessingTime(“5sec”)) .outputMode(UpdateInPlace(“user”))
.format(“jdbc”) .startStream(“mysql://...”)
Repeated Queries
DB
User Count
User 1 10
User 2 23
User 3 16
…….. ……..
Store only the processed output
instead of every single record.• Query executed repeatedly as and when the data arrives.
• Read the result from persistent storage, instead of processing the entire data set, resulting in
faster access.
© 2016 MapR Technologies 17© 2016 MapR Technologies 17
Spark 2.0: Structured Streaming with Spark SQL
Processing Time
1
Input Table
Result Table
Program Output Complete
output
OR Delta output
Output for
data at 1
Output for
data at 2
Output for
data at 3
Data upto
proc. Time 1
Data upto
proc. Time 2
Data upto
proc. Time 3
Delta: writes the records from the query
result changed from the last firing of the
trigger. These are physical deltas and not
logical deltas. That is to say, they specify
what rows were added and removed, but
not the logical difference for some row.
Append: A special case of the Delta mode
that does not include removals.
Update(​inplace): Update the result
directly in place (e.g. update a MySQL
table). Similar to delta, a primary key must
be specified.
Complete: For each run of the query, create
a complete snapshot of the query result.
Output Modes32
© 2016 MapR Technologies 18© 2016 MapR Technologies 18
Serve Data
What Do We Need to Do ?
Store DataCollect Data Process DataData Sources
Stream
Topic
© 2016 MapR Technologies 19© 2016 MapR Technologies 19
User 1
User 2
User 3
User n
.
.
.
Sparkcontext
Query Compilation
Storage
Scheduling
Worker 1
Worker 2
Worker 3
Worker 4
Worker n
.
.
Spark Scheduling Bottleneck
© 2016 MapR Technologies 20© 2016 MapR Technologies 20
Latency vs. Concurrency
Type Latency Concurrency
Batch/RTS Analytics Very Low Low
Interactive Applications Very Low High/Very High
© 2016 MapR Technologies 21© 2016 MapR Technologies 21
MapR-DB (HBase API) is Designed to Scale
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Fast Reads and Writes by Key
Data is automatically partitioned
by Key Range
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
© 2016 MapR Technologies 22© 2016 MapR Technologies 22
Serve DataStore DataCollect Data
What Do We Exactly Need to Do ?
Process DataData Sources
Stream
Topic
© 2016 MapR Technologies 23© 2016 MapR Technologies 23© 2016 MapR Technologies© 2016 MapR Technologies
Customer Use Cases
© 2016 MapR Technologies 24© 2016 MapR Technologies 24
Customer 360 & Behavior Prediction
Website
Click-Stream
Real Time/Offline
ClickStream Analysis
Internal Data Sources
External Data Sources
• Prediction Modelling
• Attribution Modelling
• Cohort Analysis
• Customer Lifetime Value
Analysis
• Attrition Modelling
• Response Modelling
• Churn Modelling
Eliminate latency due to data
movement between clusters
Eliminate Redundant storage with
MapR streams and lower the TCO
360 Degree
Customer View
Customer Behavior Prediction
Better Conversion Rate and Lower attrition $$$
Offline
Real Time
HA, DR, NFS, Snapshots,
Data Protection
EDH/EDL
Topic
Topic
Topic
Topic
Support
Tickets
DBMSEmail
CRM
© 2016 MapR Technologies 25© 2016 MapR Technologies 25
Prescriptive Analytics: IoT & Auto Manufacturing
GPS
Telemati
c Data
Telephone Truck Fleet
Data generated from cars are
stored locally
Data Modelling/Secondary ETL: Data is
converted from proprietary to parquet format
• Identify emission patterns
• Route optimization
• Customer service requests
• How does throttling affect other factors such as fuel consumption, emissions, etc.
• Image and video analysis
• Time series analysis for threshold breach
Topic
Topic
Topic
Topic
© 2016 MapR Technologies 26© 2016 MapR Technologies 26
Interactive Analytics: Risk Analysis ( Internal Users)
0-10 days old data cached in memory:
50-100 GB of data.
Data older than 10 days
accessed from disk
Analytic Application
to submit queries
with simple to medium
analytic query complexity
User 1
User 2
User 3
Concurrent requests: 3-10
Throughput: 1.5 requests per
second
Latency : < 2 secondsRepresentative Queries
• List of users who have spent more
than $1000 in last 3 days.
• Group users by country who spent
more than $1000.
Analytic Application
Type of Users: Internal
© 2016 MapR Technologies 27© 2016 MapR Technologies 27
On-Demand
Pre-Computed
Interactive Analytics: External Customer Facing
Application
Sales Incentive
Data
• 60 events/sec
• 10 MB/event
• Tabled based topics
Fast Changing Data
Ex: Credit date
Append only (50% of events)
Search Application
Stale Data. Aggregates
calculated using Snapshots.
Level 1 Aggregates
Level 2 Aggregates
Level 3 Aggregates
Advanced ML Analytics
Delta Aggregates
Pre-compute
analytics with
Spark Streaming on
Data-in-motion
Topic
Topic
Topic
Topic
DB
© 2016 MapR Technologies 28© 2016 MapR Technologies 28© 2016 MapR Technologies© 2016 MapR Technologies
Demo
© 2016 MapR Technologies 29© 2016 MapR Technologies 29
What if BP had detected problems before
the oil hit the water ?
1M samples/sec
High performance at
scale is necessary!
© 2016 MapR Technologies 30© 2016 MapR Technologies 30
Use Case: Time Series Data
Data for
real-time monitoring
Sensor
time-stamped data
Spark
processing
readSpark
Streaming
Stream
Topic
© 2016 MapR Technologies 31© 2016 MapR Technologies 31
Use Case: Time Series Data
Sensor
time-stamped data
Stream
Topic
COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94
COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
Data: PumpId, Date,Time , pressure and flow measurements
© 2016 MapR Technologies 32© 2016 MapR Technologies 32
Schema
• All events stored, CF data could be set to expire data
• Filtered alerts put in CF alerts
• Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil
pump name, date, and
a time stamp
© 2016 MapR Technologies 33© 2016 MapR Technologies 33
Schema
• All events stored, CF data could be set to expire data
• Filtered alerts put in CF alerts
• Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2016 MapR Technologies 34© 2016 MapR Technologies 34
Schema
• All events stored, CF data could be set to expire data
• Filtered alerts put in CF alerts
• Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2016 MapR Technologies 35© 2016 MapR Technologies 35
Serve Data
What Do We Need to Do ?
Data Sources Store DataCollect Data Process Data
Stream
Topic
© 2016 MapR Technologies 36© 2016 MapR Technologies 36
readSpark
Streaming
Stream
Topic
Use Case Example Code
Data for
real-time monitoring
Sensor
time-stamped data Spark processing
© 2016 MapR Technologies 37© 2016 MapR Technologies 37
KafkaProducer
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
//1 configure KafkaProducer properties
Properties properties = new Properties();
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
//2 Create KafkaProducer with properties
kafkaProducer = new KafkaProducer<String, String>(properties);
String txt = “msg text”;
//3 Create producer records with topic and message
ProducerRecord<String, String> record = new
ProducerRecord<String, String>(topic, txt);
//4 use kafka producer to send records
kafkaProducer.send(record);
© 2016 MapR Technologies 38© 2016 MapR Technologies 38
readSpark
Streaming
Stream
Topic
Use Case Example Code
Data for
real-time monitoring
Sensor
time-stamped data Spark processing
© 2016 MapR Technologies 39© 2016 MapR Technologies 39
Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))
// create an input Stream for set of topics
val dStream = KafkaUtils.createDirectStream[String,
String](ssc, kafkaParams, topicsSet)
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD
© 2016 MapR Technologies 40© 2016 MapR Technologies 40
Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
// Parse CSV Strings into Sensor objects
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
© 2016 MapR Technologies 41© 2016 MapR Technologies 41
Process DStream
// Parse message values into Sensor objects
val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
sensorDStream RDDs
New RDDs created
for every batch
map map map
© 2016 MapR Technologies 42© 2016 MapR Technologies 42
DataFrame and SQL Operations
// for Each RDD
sensorDStream.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
// convert RDD to DataFrame
rdd.toDF().registerTempTable("sensor")
// get the avg max min for pump values
val res = sqlContext.sql( "SELECT resid, date,
max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz,
max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp,
max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo,
max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi
FROM sensor GROUP BY resid,date”)
res.show()
}
© 2016 MapR Technologies 43© 2016 MapR Technologies 43
Streaming Application Output
© 2016 MapR Technologies 44© 2016 MapR Technologies 44
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist
data to external storage
Put objects written
to HBase
batch
time 2-3
batch
time 1 to 2
batch
time 0 to 1
mapmap map
savesave save
© 2016 MapR Technologies 45© 2016 MapR Technologies 45
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
© 2016 MapR Technologies 46© 2016 MapR Technologies 46
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing
© 2016 MapR Technologies 47© 2016 MapR Technologies 47
Q&AEngage with us!
1. Read explanation of and Download code
– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db
– https://www.mapr.com/blog/spark-streaming-hbase
2. Get Started: MapR Converged Data Platform
https://www.mapr.com/get-started-with-mapr
3. Get Answers: MapR Converge Community
https://community.mapr.com/community/answers
4. Get Trained: MapR On-Demand Training
https://learn.mapr.com

How Spark is Enabling the New Wave of Converged Applications

  • 1.
    © 2016 MapRTechnologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies How Spark is Enabling the New Wave of Converged Applications Balaji Mohanam and Carol McDonald September, 2016
  • 2.
    © 2016 MapRTechnologies 2© 2016 MapR Technologies 2 Today’s Presenters Carol McDonald Solutions Architect Balaji Mohanam Product Manager
  • 3.
    © 2016 MapRTechnologies 3© 2016 MapR Technologies 3 Agenda • Market Trends • What’s Needed for Converged Applications • Customer Use Cases • Demo of MapR Streams with Spark Streaming
  • 4.
    © 2016 MapRTechnologies 4© 2016 MapR Technologies 4 Analytics & ETL: Batch or Streaming? V a l u e T i m e
  • 5.
    © 2016 MapRTechnologies 5© 2016 MapR Technologies 5 Analytic Categories Descriptive Predictive Streaming Prescriptive Data-At-Rest Data-In-Motion Future • What happened • Why did it happen • Discovery in nature • Batch analytics • What will happen • Combines historical data with rules and algorithms • ML (Batch + Real Time) • What + When + Why • Suggestions to take advantage of future opportunity or mitigate risks • Volume, velocity and variety • Agility is key to success. • Analyse data as it happens • Triggers and Alarms. • Anomaly detection • Continuous ETL and analytics
  • 6.
    © 2016 MapRTechnologies 6© 2016 MapR Technologies 6 Decreasing Job Latencies Hours Mins Secs Milli Secs Data persistence on-disk Data persistence in-memory
  • 7.
    © 2016 MapRTechnologies 7© 2016 MapR Technologies 7 It was hot at 6:05 yesterday! Why Stream Processing? A n a l y z e 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° 90°90°6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° Batch processing may be too late for some events
  • 8.
    © 2016 MapRTechnologies 8© 2016 MapR Technologies 8 Why Stream Processing? 6:05 P.M.: 90° To pi c Temperature Turn on the air conditioning! It’s becoming important to process events as they arrive S t r e a m
  • 9.
    © 2016 MapRTechnologies 9© 2016 MapR Technologies 9© 2016 MapR Technologies© 2016 MapR Technologies What’s Needed for Converged Applications
  • 10.
    © 2016 MapRTechnologies 10© 2016 MapR Technologies 10 The Trinity of Real Time Topic 1 Real Time Producers Topic 2 Global Messaging System No SQL Key Value Database Spark + MapR DB Integration Real Time Operational Analytics Transformational Tier Spark + MapR Streams Integration
  • 11.
    © 2016 MapRTechnologies 11© 2016 MapR Technologies 11 Open Source Engines & Tools Commercial Engines & Applications Enterprise-Grade Platform Services DataProcessing Web-Scale Storage MapR-FS MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability MapR Streams Cloud and Managed Services Search and Others UnifiedManagementandMonitoring Search and Others Event StreamingDatabase Custom Apps HDFS API POSIX, NFS HBase API JSON API Kafka API MapR Converged Data Platform
  • 12.
    © 2016 MapRTechnologies 12© 2016 MapR Technologies 12 Use Case: Time Series Data in Oil Wells Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  • 13.
    © 2016 MapRTechnologies 13© 2016 MapR Technologies 13 Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  • 14.
    © 2016 MapRTechnologies 15© 2016 MapR Technologies 15 Scalable Messaging with MapR Streams Topics are partitioned for throughput and scalability Partition 1: Topic - Pressure Partition 1: Topic - Temperature Partition 1: Topic - Warning Partition 2: Topic - Pressure Partition 2: Topic - Temperature Partition 2: Topic - Warning Partition 3: Topic - Pressure Partition 3: Topic - Temperature Partition 3: Topic - Warning Consumers Consumers Consumers !
  • 15.
    © 2016 MapRTechnologies 16© 2016 MapR Technologies 16 Continuous Analytics: Structured Streaming with Spark 2.0 valrecords=sqlContext.read.format(“json”).stream(“hdfs://input”) valcounts=records.groupBy(“user”).count() counts.write .trigger(ProcessingTime(“5sec”)) .outputMode(UpdateInPlace(“user”)) .format(“jdbc”) .startStream(“mysql://...”) Repeated Queries DB User Count User 1 10 User 2 23 User 3 16 …….. …….. Store only the processed output instead of every single record.• Query executed repeatedly as and when the data arrives. • Read the result from persistent storage, instead of processing the entire data set, resulting in faster access.
  • 16.
    © 2016 MapRTechnologies 17© 2016 MapR Technologies 17 Spark 2.0: Structured Streaming with Spark SQL Processing Time 1 Input Table Result Table Program Output Complete output OR Delta output Output for data at 1 Output for data at 2 Output for data at 3 Data upto proc. Time 1 Data upto proc. Time 2 Data upto proc. Time 3 Delta: writes the records from the query result changed from the last firing of the trigger. These are physical deltas and not logical deltas. That is to say, they specify what rows were added and removed, but not the logical difference for some row. Append: A special case of the Delta mode that does not include removals. Update(​inplace): Update the result directly in place (e.g. update a MySQL table). Similar to delta, a primary key must be specified. Complete: For each run of the query, create a complete snapshot of the query result. Output Modes32
  • 17.
    © 2016 MapRTechnologies 18© 2016 MapR Technologies 18 Serve Data What Do We Need to Do ? Store DataCollect Data Process DataData Sources Stream Topic
  • 18.
    © 2016 MapRTechnologies 19© 2016 MapR Technologies 19 User 1 User 2 User 3 User n . . . Sparkcontext Query Compilation Storage Scheduling Worker 1 Worker 2 Worker 3 Worker 4 Worker n . . Spark Scheduling Bottleneck
  • 19.
    © 2016 MapRTechnologies 20© 2016 MapR Technologies 20 Latency vs. Concurrency Type Latency Concurrency Batch/RTS Analytics Very Low Low Interactive Applications Very Low High/Very High
  • 20.
    © 2016 MapRTechnologies 21© 2016 MapR Technologies 21 MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Col B Col C val val val xxx val val Fast Reads and Writes by Key Data is automatically partitioned by Key Range Key Range xxxx xxxx Key Col B Col C val val val xxx val val Key Range xxxx xxxx Key Col B Col C val val val xxx val val
  • 21.
    © 2016 MapRTechnologies 22© 2016 MapR Technologies 22 Serve DataStore DataCollect Data What Do We Exactly Need to Do ? Process DataData Sources Stream Topic
  • 22.
    © 2016 MapRTechnologies 23© 2016 MapR Technologies 23© 2016 MapR Technologies© 2016 MapR Technologies Customer Use Cases
  • 23.
    © 2016 MapRTechnologies 24© 2016 MapR Technologies 24 Customer 360 & Behavior Prediction Website Click-Stream Real Time/Offline ClickStream Analysis Internal Data Sources External Data Sources • Prediction Modelling • Attribution Modelling • Cohort Analysis • Customer Lifetime Value Analysis • Attrition Modelling • Response Modelling • Churn Modelling Eliminate latency due to data movement between clusters Eliminate Redundant storage with MapR streams and lower the TCO 360 Degree Customer View Customer Behavior Prediction Better Conversion Rate and Lower attrition $$$ Offline Real Time HA, DR, NFS, Snapshots, Data Protection EDH/EDL Topic Topic Topic Topic Support Tickets DBMSEmail CRM
  • 24.
    © 2016 MapRTechnologies 25© 2016 MapR Technologies 25 Prescriptive Analytics: IoT & Auto Manufacturing GPS Telemati c Data Telephone Truck Fleet Data generated from cars are stored locally Data Modelling/Secondary ETL: Data is converted from proprietary to parquet format • Identify emission patterns • Route optimization • Customer service requests • How does throttling affect other factors such as fuel consumption, emissions, etc. • Image and video analysis • Time series analysis for threshold breach Topic Topic Topic Topic
  • 25.
    © 2016 MapRTechnologies 26© 2016 MapR Technologies 26 Interactive Analytics: Risk Analysis ( Internal Users) 0-10 days old data cached in memory: 50-100 GB of data. Data older than 10 days accessed from disk Analytic Application to submit queries with simple to medium analytic query complexity User 1 User 2 User 3 Concurrent requests: 3-10 Throughput: 1.5 requests per second Latency : < 2 secondsRepresentative Queries • List of users who have spent more than $1000 in last 3 days. • Group users by country who spent more than $1000. Analytic Application Type of Users: Internal
  • 26.
    © 2016 MapRTechnologies 27© 2016 MapR Technologies 27 On-Demand Pre-Computed Interactive Analytics: External Customer Facing Application Sales Incentive Data • 60 events/sec • 10 MB/event • Tabled based topics Fast Changing Data Ex: Credit date Append only (50% of events) Search Application Stale Data. Aggregates calculated using Snapshots. Level 1 Aggregates Level 2 Aggregates Level 3 Aggregates Advanced ML Analytics Delta Aggregates Pre-compute analytics with Spark Streaming on Data-in-motion Topic Topic Topic Topic DB
  • 27.
    © 2016 MapRTechnologies 28© 2016 MapR Technologies 28© 2016 MapR Technologies© 2016 MapR Technologies Demo
  • 28.
    © 2016 MapRTechnologies 29© 2016 MapR Technologies 29 What if BP had detected problems before the oil hit the water ? 1M samples/sec High performance at scale is necessary!
  • 29.
    © 2016 MapRTechnologies 30© 2016 MapR Technologies 30 Use Case: Time Series Data Data for real-time monitoring Sensor time-stamped data Spark processing readSpark Streaming Stream Topic
  • 30.
    © 2016 MapRTechnologies 31© 2016 MapR Technologies 31 Use Case: Time Series Data Sensor time-stamped data Stream Topic COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94 COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66 COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79 Data: PumpId, Date,Time , pressure and flow measurements
  • 31.
    © 2016 MapRTechnologies 32© 2016 MapR Technologies 32 Schema • All events stored, CF data could be set to expire data • Filtered alerts put in CF alerts • Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0 Row Key contains oil pump name, date, and a time stamp
  • 32.
    © 2016 MapRTechnologies 33© 2016 MapR Technologies 33 Schema • All events stored, CF data could be set to expire data • Filtered alerts put in CF alerts • Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 33.
    © 2016 MapRTechnologies 34© 2016 MapR Technologies 34 Schema • All events stored, CF data could be set to expire data • Filtered alerts put in CF alerts • Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 34.
    © 2016 MapRTechnologies 35© 2016 MapR Technologies 35 Serve Data What Do We Need to Do ? Data Sources Store DataCollect Data Process Data Stream Topic
  • 35.
    © 2016 MapRTechnologies 36© 2016 MapR Technologies 36 readSpark Streaming Stream Topic Use Case Example Code Data for real-time monitoring Sensor time-stamped data Spark processing
  • 36.
    © 2016 MapRTechnologies 37© 2016 MapR Technologies 37 KafkaProducer String topic=“/streams/pump:warning”; public static KafkaProducer producer; //1 configure KafkaProducer properties Properties properties = new Properties(); properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); //2 Create KafkaProducer with properties kafkaProducer = new KafkaProducer<String, String>(properties); String txt = “msg text”; //3 Create producer records with topic and message ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, txt); //4 use kafka producer to send records kafkaProducer.send(record);
  • 37.
    © 2016 MapRTechnologies 38© 2016 MapR Technologies 38 readSpark Streaming Stream Topic Use Case Example Code Data for real-time monitoring Sensor time-stamped data Spark processing
  • 38.
    © 2016 MapRTechnologies 39© 2016 MapR Technologies 39 Create a DStream DStream: a sequence of RDDs representing a stream of data val ssc = new StreamingContext(sparkConf, Seconds(5)) // create an input Stream for set of topics val dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD
  • 39.
    © 2016 MapRTechnologies 40© 2016 MapR Technologies 40 Message Data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) // Parse CSV Strings into Sensor objects def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  • 40.
    © 2016 MapRTechnologies 41© 2016 MapR Technologies 41 Process DStream // Parse message values into Sensor objects val sensorDStream = dStream.map(_._2).map(parseSensor) dStream RDDs batch time 2 to 3 batch time 1 to 2 batch time 0 to 1 sensorDStream RDDs New RDDs created for every batch map map map
  • 41.
    © 2016 MapRTechnologies 42© 2016 MapR Technologies 42 DataFrame and SQL Operations // for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) // convert RDD to DataFrame rdd.toDF().registerTempTable("sensor") // get the avg max min for pump values val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date”) res.show() }
  • 42.
    © 2016 MapRTechnologies 43© 2016 MapR Technologies 43 Streaming Application Output
  • 43.
    © 2016 MapRTechnologies 44© 2016 MapR Technologies 44 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) linesRDD DStream sensorRDD DStream output operation: persist data to external storage Put objects written to HBase batch time 2-3 batch time 1 to 2 batch time 0 to 1 mapmap map savesave save
  • 44.
    © 2016 MapRTechnologies 45© 2016 MapR Technologies 45 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  • 45.
    © 2016 MapRTechnologies 46© 2016 MapR Technologies 46 Stream Processing Building a Complete Data Architecture MapR File System (MapR-FS) MapR Converged Data Platform MapR Database (MapR-DB) MapR Streams Sources/Apps Bulk Processing
  • 46.
    © 2016 MapRTechnologies 47© 2016 MapR Technologies 47 Q&AEngage with us! 1. Read explanation of and Download code – https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db – https://www.mapr.com/blog/spark-streaming-hbase 2. Get Started: MapR Converged Data Platform https://www.mapr.com/get-started-with-mapr 3. Get Answers: MapR Converge Community https://community.mapr.com/community/answers 4. Get Trained: MapR On-Demand Training https://learn.mapr.com