Spark and MapR Streams: A Motivating Example

© 2017 MapR Technologies 1
Spark and MapR Streams: A
Motivating Example

Abstract
• Businesses are discovering the untapped potential of large datasets and data
streams through the use of technologies for big data processing and storage. By
leveraging these assets they’re creating a new generation of applications that
derive value from data they used to throw away.
• In this presentation we’ll discuss how to build operational environments for
these types of applications with the MapR Converged Data Platform and we’ll
walk through an example of a next-generation application that uses Java APIs
for MapR Streams, Apache Spark, Apache Hive, and MapR-DB.
• We’ll see how these technologies can be used to join and transform unbounded
datasets to find signals and derive new data streams for a financial scenario
involving real-time algorithmic trading and historical analysis using SQL.
• We’ll also discuss how MapR enables you to run real-time data applications with
the speed, reliability, and security you need for a production environment.
• Keywords: MapR, Spark, Kafka, NoSQL, JSON, Zeppelin, Hive, streaming

Contact Info
Ian Downard
Technical Evangelist at MapR Technologies
idownard@mapr.com
Personal Blog: http://bigendiandata.com
Twitter: @iandownard

Learning Goals
1. Appreciate the opportunity of the time we’re in.
2. Become familiar with MapR
3. Become familiar with Spark
4. Feel empowered.

Why Now?
• But Moore’s law has applied for a long time
• Why is data exploding now?
• Why not 10 years ago?
• Why not 20?

Because data wasn’t available?
• If it were just availability of data then existing big companies
would adopt big data technology first

Because data wasn’t available?
• If it were just availability of data then existing big companies
would adopt big data technology first
They didn’t

Because processing it was too expensive?
• If it were just a net positive value then finance companies
should adopt first because they have higher opportunity value
/ byte

Because processing it was too expensive?
• If it were just a net positive value then finance companies
should adopt first because they have higher opportunity value
/ byte
They didn’t

Backwards adoption
• Under almost any argument, startups would not have adopted
big data technology first

Backwards adoption
• Under almost any argument, startups would not have adopted
big data technology first
They did

Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– By large companies and small

Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– By large companies and small
Why?

Data Analytics Scaling Laws
• Analytics scaling is all about:
– Big gains for little initial effort
– Rapidly diminishing returns
• The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
• Cost/performance has radically changed
– Cluster computing, commodity hardware, data science frameworks…

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
If we can handle the scale
It’s really big

So what makes
that possible?

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Net value optimum has
a sharp peak well
before maximum effort

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
But scaling laws are
changing both slope and
shape

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
More than just a
little

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
They are changing a
LOT!

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value

2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Initially, linear cost scaling
actually makes things
worse
Then a tipping point is
reached and things change
radically …

MapR Overview

How do you persist data?

All major persistence abstractions are one of these:
Files
tokyo
Streams
User
proﬁles
Tables

HDFS
SOURCE
DATA
STREAM
PROCESSING & STORAGE
FINAL
OUTPUT
STORAGE
Kafka
Kafka
Kafka
Kafka
Kafka
Spark
Cassandra /
Mongo
Cassandra /
Mongo
Cassandra /
Mongo
“Classic” streaming involves single-purpose clusters.

MapR-FS
SOURCE
DATA
STREAM
PROCESSING & STORAGE
FINAL
OUTPUT
STORAGE
MapR
Streams
Spark
MapR-DB
MapR converges the data layer into a single cluster.

What is MapR?
A Data Platform
Converged^

Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
DataProcessing
Web-Scale Storage
MapR-FS MapR-DB
Search and
Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and
Managed
Services
Search and
Others
UnifiedManagementandMonitoring
Search and
Others
Event StreamingDatabase
Custom
Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform

“Convergence” means…
• One cluster that does it all: Files + Tables + Streams
• Standard APIs for everything
• A distributed file system that looks “normal” (POSIX)
• Unified Management
• Global Namespace
• Mirroring, Replication, and Snapshots
– Synchronize files, tables, and streams across datacenters
– True failover for your applications

How do I use MapR?
• Installs on Linux (e.g. Ubuntu, Redhat) typically to a block
device, and typically to a cluster of 3 or more nodes.
• Packaged as a scriptable / web-based installer, cloud
marketplace offers, Docker containers
• Sandbox VMs for your laptops.

MapR In Action

Apply MapR as a data layer for containers.
Producer Servlet Engine
HTTP Log
Browser

Procedure
1. Download Sandbox
– Configure for Host-only Adapter
2. Download github repo
3. Compile code
4. Build Docker images
5. Create the MapR Stream topics
6. Run the Docker containers

Procedure
• Download Sandbox
– Configure for Host-only Adapter

Docker / MapR demo commands
1. git clone https://github.com/mapr-demos/mapr-pacc-sample
2. maprcli stream create -path /apps/sensors -produceperm p -consumeperm
p -topicperm p
3. maprcli stream topic create -path /apps/sensors -topic computer
4. /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-consumer.sh --new-
consumer --bootstrap-server this.will.be.ignored:9092 --topic
/apps/sensors:computer
5. docker run -it -e MAPR_CLDB_HOSTS=192.168.99.3 -e
MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr --name
producer -i -t mapr-sensor-producer
6. docker run -it --privileged --cap-add SYS_ADMIN --cap-add SYS_RESOURCE
--device /dev/fuse -e MAPR_CLDB_HOSTS=192.168.99.3 -e
MAPR_CLUSTER=demo.cluster.com -e MAPR_CONTAINER_USER=mapr -e
MAPR_MOUNT_PATH=/mapr -p 8080:8080 --device /dev/fuse --name web -i -t
mapr-web-consumer
7. Open http://localhost:8080
8. Open http://192.168.99.3:8443

References
• MapR Sandbox
http://maprdocs.mapr.com/home/SandboxHadoop/t_install_sa
ndbox_vbox.html
• MapR sample application
https://mapr.com/blog/getting-started-mapr-client-container/
• MapR Tutorials
https://mapr.com/developercentral/code/

Apache Spark

https://databricks.com/spark/about

Resilient Distributed Datasets (RDDs)
• RDDs – lets programmers perform in-memory computations on
large distributed datasets in a fault-tolerant manner
• RDD is a representation of data that may or may not be on
your local machine. It’s partitioned across the cluster. (like a
distributed java Collection).
• RDD is immutable
– JavaRDD<String> lines = sc.textFile(“/path/to/data.log”)
• When you read data, nothing gets loaded. You’re not even
opening it. We first declare the operations that we’re going to
perform, then in the end the data is loaded and operated
upon when we perform an action that materializes the data.

1. Start by reading from files, DB, etc. to create a top level RDD
2. Lazy Transformations
.filter(), .map(), shuffle(), sample()
3. Actions (retrieval of the data) trigger stuff to finally run. Pulls
all the data into the JVM.
.savetoCassandra() .count(), .collect()
4. Once you have an RDD you like to work on, you can call
.cache() on it to keep it around, so you don’t have to derive it
again. By default cache will save to disk.

• RDD is building block of Spark.
– Dataframe, Dataset, DStream, etc are all abstractions for RDD
• immutable
• Operated on by lambda functions.
• Lazily evaluated
• Kick off parallel execution with actions like collect(), count(),
etc.

What is Spark Streaming?
• enables scalable, high-throughput, fault-tolerant stream
processing of live data
• Run continuous SQL queries on data pushed into Kafka
Data Sources Data Sinks

tail -f
MapR Streams store
and expose stream data
for processing
Output
action

Spark Streaming Architecture
• processed results are pushed out in batches
Spark
batches of processed
results
Spark
Streaming
input data
stream
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
Batch
interval

Spark In Action

Spark In Action
• Spark Shell
• Spark SQL in Zeppelin
• Spark SQL Databricks Notebook
• Spark Streaming Java API
• Debugging Spark with IntelliJ

Databricks Cloud
• Spark notebook in
the cloud
– https://community.cl
oud.databricks.com/
• Sample notebooks:
– https://databricks.co
m/resources/type/ex
ample-notebooks

Spark Shell (aka REPL)
• If you install spark locally, you get
this.
• Evals commands immediately
when you type it in, and shows you
the output.
• Fantastic way to experiment, with
tab completion.

Apache Zeppelin

Debugging Spark with IntelliJ
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=
dt_socket,server=y,suspend=y,address=4000

Monitoring
http://[hostname]:4040/jobs/

Spark Streaming + ML on MapR
• Predict the location and time of Taxi requests.
– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1
– https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2
streaming topic:
location and time of
taxi requests
Predicted and
actual pickup
locations and
times
Classification
Models (Spark ML)
Ridership
analytics
(Zeppelin)
Kmeans Clustering
(Spark ML on Uber
dataset)

Streaming + ML demo procedure
1. Create topics:
maprcli stream create -path /user/mapr/stream -produceperm p -consumeperm p -topicperm p
maprcli stream topic create -path /user/mapr/stream -topic ubers -partitions 3
maprcli stream topic create -path /user/mapr/stream -topic uberp -partitions 3
2. Create and save the kmeans model to /mapr/my.cluster.com/user/mapr/data/savemodel:
/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkml.uber.ClusterUber --master
local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0.jar
3. Send test dataset to a stream (just to illustrate using a stream):
java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-
1.0.jar:`mapr classpath` com.streamskafka.uber.MsgProducer /user/mapr/stream:ubers
/mapr/my.cluster.com/user/mapr/data/uber.csv
4. Monitor the test dataset (optional, on nodeb):
java -cp /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-
1.0.jar:`mapr classpath` com.streamskafka.uber.MsgConsumer /user/mapr/stream:ubers
5. Use the model to predict cluster for incoming taxi telemetry, output predictions to a topic:
/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class
com.sparkkafka.uber.SparkKafkaConsumerProducer --master local[2] /home/mapr/mapr-sparkml-
streaming-uber/target/mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar
/user/mapr/data/savemodel /user/mapr/stream:ubers /user/mapr/stream:uberp
6. Read the predictions topic and put it into a format that we can adhoc analyze in SQL
/opt/mapr/spark/spark-2.0.1/bin/spark-submit --class com.sparkkafka.uber.SparkKafkaConsumer --
master local[2] /home/mapr/mapr-sparkml-streaming-uber/target/mapr-sparkml-streaming-uber-1.0-
jar-with-dependencies.jar /user/mapr/stream:uberp
7. Open http://nodea:4040

Real-Time Stock Market Analysis
https://mapr.com/appblueprint
https://github.com/mapr-demos/finserv-application-blueprint

Advanced Concept:
Look Back for n Seconds on a Topic
Time
Data Topic
Offset Topic
t₀ t₁ t₂ t₃ t₄ t₅
3253 3347 3467 3608 3798 3913
Offset Topic: Key = Time t, Value = Offset of Data Topic at t

MapR Streams vs Kafka

Call To Action

Call To Action
• You can foster innovation just by making data available.
• Seeking career advancement?
– Coursera classes on data science, ML, Spark, etc.
– Be a polyglot.
– Enable data science from development to production.
• You can apply those skills in ANY industry.
• Don’t be afraid by not knowing much.
• 87% of career builders attribute career benefit to completing
online courses (Harvard Business Review, Coursera)
– Be better equipped for current job, find a new job, change career.

All Industries Web 2.0 Healthcare Telecom
• ETL / DW optimization
• Mainframe optimization
• Real-time application &
network monitoring
• Security information & event
management
• Recommendation engines &
targeting
• Customer 360
• Click-stream analysis
• Social media analysis
• Ad optimization
• Patient system of record
• Smart hospitals
• Biometrics
• Patient vital monitoring
• Fraud detection
• Crowd-based antenna
optimization
• Charging & billing
• Equipment monitoring &
preventative maintenance
• Smart meter analysis
Have an interesting use case? Let’s talk!
Oil & Gas Financial Services Retail Ad Tech
• Pump monitoring & alerting
• Seismic trace identification
• Equipment maintenance
• Safety & environment
• Security
• Real-time fraud/risk
monitoring
• Mobile notifications of
transactions
• Real-time supply chain
optimization
• Customer location
optimization
• Real-time coupons
• Ad targeting & optimization
• Global campaign dashboards

Spark and MapR Streams: A Motivating Example

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark and MapR Streams: A Motivating Example

Similar to Spark and MapR Streams: A Motivating Example (20)

Recently uploaded

Recently uploaded (20)

Spark and MapR Streams: A Motivating Example