How Spark is Enabling the New Wave of Converged Cloud Applications

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
How Spark is Enabling
the New Wave of Converged Cloud Applications
Ankur Desai & Carol McDonald
December, 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Today’s Presenters
Carol McDonald
Solutions Architect
Ankur Desai
Sr Mgr, Platform & Products

Agenda
• Market Trends
• What’s Needed for Converged Streaming Applications
• Use Cases
• Demo of MapR Streams with Spark Streaming

Flexible processing where
change is the norm
Distributed processing across clusters, data
centers, public & private cloud environments
Supports global apps that
can scale arbitrarily
A Single Platform: On-Prem, In the Cloud, or InterCloud

MapR on Microsoft Azure Marketplace
MapR and Microsoft enable enterprise grade big data applications in the Azure cloud
Simplified Deployment
Azure Marketplace’s automated deployment
capabilities make big data easy
Azure’s infrastructure can scale up to match any
requirement and scale down for value
MapR integrates with other Azure services to
enable customers to analyze any type of data to
unlock the biggest insights
Unlimited Scale Seamless Interoperability
Product Alignment

Digital transformation for better customer experience
Deliver self-service insights across the business
• MapR platform on the Azure cloud to modernize their infrastructure and
sunset legacy systems.
• Faster exploration of data with Apache Drill mitigating need for
schema development.
• Support for use cases such as customer 360, supply chain & image
analysis
OBJECTIVES
CHALLENGES
SOLUTION
• Modernize analytics & improve speed of marketing campaigns
• Reduce cost of existing systems
•
• Existing technologies prohibiting effective & timely reporting and
analysis
• Very long time to extract value from the data leading to lots of Excel
Leading optical retail chain

© 2016 MapR Technologies 7© 2016 MapR Technologies 7© 2016 MapR Technologies© 2016 MapR Technologies© 2016 MapR Technologies
The Need For Streaming

Decreasing Job Latencies
Hours Mins Secs Milli Secs
Data persistence
on-disk
Data persistence
in-memory

Big Data is Continuously Generated One Event at a Time
“time” : “6:01.103”,
“event” : “RETWEET”,
“location” :
“lat” : 40.712784,
“lon” : -74.005941
“time: “5:04.120”,
“severity” : “CRITICAL”,
“msg” : “Service down”
“card_num” : 1234,
“merchant” : ”MERCH1”,
“amount” : 50

It was hot at
6:05 yesterday!
Why Stream Processing?
A n a l y z e
6:01 P.M.:
72°
6:02 P.M.:
75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
90°90°6:01 P.M.: 72°
6:02 P.M.: 75°
6:03 P.M.: 77°
6:04 P.M.: 85°
6:05 P.M.: 90°
6:06 P.M.: 85°
6:07 P.M.: 77°
6:08 P.M.: 75°
Batch processing may be too late for some events

Why Stream Processing?
6:05 P.M.: 90°
To
pi
c
Temperature
Turn on the air
conditioning!
It’s becoming important to process events as they arrive
S t r e a m

© 2016 MapR Technologies 12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies
Anatomy of Converged Streaming Applications

The Trinity of Real-time
Topic 1
Real Time
Producers
Topic 2
Global Messaging System
Persistence
(Databases and Files)
Real Time
Operational
Analytics
Stream Processing

Serve DataStore DataStream Data
Creating the Streaming Pipeline
Process DataData Sources
Topic

Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
DataProcessing
Web-Scale Storage
MapR-FS MapR-DB
Search
and
Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud
and
Managed
Services
Search and
Others
UnifiedManagementandMonitoring
Search
and
Others
Event StreamingDatabase
Custom
Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform

MapR Streams:
Global Pub-sub Event Streaming System for Big Data
Producers publish billions of
messages/sec to a topic in a stream.
Guaranteed, immediate delivery
to all consumers.
Tie together geo-dispersed clusters.
Worldwide.
Standard real-time API (Kafka).
Integrates with Spark Streaming,
Storm, Apex, and Flink
Direct data access (OJAI API) from
analytics frameworks.
To
pi
c
Stream
Producers
Remote sites and consumers
Batch analytics
Topic
Replication
Consumers
Consumers

Scalable Event Streaming with MapR Streams
Topics are partitioned for throughput and scalability
Partition 1: Topic - Pressure
Partition 1: Topic - Temperature
Partition 1: Topic - Warning
Consumers
Consumers
Consumers
!

MapR-DB is Designed to Scale
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Fast Reads and Writes by Key
Data is automatically partitioned
by Key Range
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val

Use Cases

Customer 360 & Behavior Prediction
Website
Click-Stream
Real Time/Offline
ClickStream Analysis
Internal Data Sources
External Data Sources
• Prediction Modelling
• Attribution Modelling
• Cohort Analysis
• Customer Lifetime Value
Analysis
• Attrition Modelling
• Response Modelling
• Churn Modelling
Eliminate latency due to data
movement between clusters
Eliminate Redundant storage with
MapR streams and lower the TCO
360 Degree
Customer View
Customer Behavior Prediction
Better Conversion Rate and Lower attrition $$$
Offline
Real Time
HA, DR, NFS, Snapshots,
Data Protection
EDH/EDL
Topic
Topic
Topic
Topic
Support
Tickets
DBMSEmail
CRM

Prescriptive Analytics: IoT & Auto Manufacturing
GPS
Telemati
c Data
Telephone Truck Fleet
Data generated from cars are
stored locally
Data Modelling/Secondary ETL: Data is
converted from proprietary to parquet format
• Identify emission patterns
• Route optimization
• Customer service requests
• How does throttling affect other factors such as fuel consumption, emissions, etc.
• Image and video analysis
• Time series analysis for threshold breach
Topic
Topic
Topic
Topic

Demo

What if BP had detected problems before
the oil hit the water ?
1M samples/sec
High performance at
scale is necessary!

Use Case: Time Series Data
Data for
real-time monitoring
Sensor
time-stamped data
Spark
processing
readSpark
Streaming
Stream
Topic

Use Case: Time Series Data
Sensor
time-stamped data
Stream
Topic
COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94
COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
Data: PumpId, Date,Time , pressure and flow measurements

Schema
• All events stored, CF data could be set to expire data
• Filtered alerts put in CF alerts
• Daily summaries put in CF stats
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil
pump name, date, and
a time stamp

Schema
Row key
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0

Serve Data
What Do We Need to Do ?
Data Sources Store DataCollect Data Process Data
Stream
Topic

readSpark
Streaming
Stream
Topic
Use Case Example Code
Data for
Sensor
time-stamped data Spark processing

KafkaProducer
String topic=“/streams/pump:warning”;
public static KafkaProducer producer;
//1 configure KafkaProducer properties
Properties properties = new Properties();
properties.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
//2 Create KafkaProducer with properties
kafkaProducer = new KafkaProducer<String, String>(properties);
String txt = “msg text”;
//3 Create producer records with topic and message
ProducerRecord<String, String> record = new
ProducerRecord<String, String>(topic, txt);
//4 use kafka producer to send records
kafkaProducer.send(record);

readSpark
Streaming
Stream
Topic
Use Case Example Code
Data for
Sensor
time-stamped data Spark processing

Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))
// create an input Stream for set of topics
val dStream = KafkaUtils.createDirectStream[String,
String](ssc, kafkaParams, topicsSet)
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD

Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
// Parse CSV Strings into Sensor objects
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}

Process DStream
// Parse message values into Sensor objects
val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
sensorDStream RDDs
New RDDs created
for every batch
map map map

DataFrame and SQL Operations
// for Each RDD
sensorDStream.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
// convert RDD to DataFrame
rdd.toDF().registerTempTable("sensor")
// get the avg max min for pump values
val res = sqlContext.sql( "SELECT resid, date,
max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz,
max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp,
max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo,
max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi
FROM sensor GROUP BY resid,date”)
res.show()
}

Streaming Application Output

Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist
data to external storage
Put objects written
to HBase
batch
time 2-3
batch
time 1 to 2
batch
time 0 to 1
mapmap map
savesave save

Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()

Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-FS)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Streams
Sources/Apps Bulk Processing

Azure and MapR Resources – 3 steps to get started
• Azure Overview
https://www.mapr.com/partners/partner/microsoft-azure-microsofts-cloud-
computing-platform-moving-faster-achieving-more
• 7 Steps to Deploy the MapR Sandbox on Azure
https://www.mapr.com/blog/7-steps-deploy-mapr-sandbox-microsoft-azure
• Azure Test Drive
http://mapr.testdrivelabs.com/ (subject to change)

Q&AEngage with us!
1. Read explanation of and Download code
– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db
– https://www.mapr.com/blog/spark-streaming-hbase
2. Get Started: MapR Converged Data Platform
https://www.mapr.com/get-started-with-mapr
3. Get Answers: MapR Converge Community
https://community.mapr.com/community/answers
4. Get Trained: MapR On-Demand Training
https://learn.mapr.com

How Spark is Enabling the New Wave of Converged Cloud Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to How Spark is Enabling the New Wave of Converged Cloud Applications

Similar to How Spark is Enabling the New Wave of Converged Cloud Applications (20)

More from MapR Technologies

More from MapR Technologies (17)

Recently uploaded

Recently uploaded (20)

How Spark is Enabling the New Wave of Converged Cloud Applications