© 2016 Mesosphere, Inc. All Rights Reserved. 1
@joerg_schad @dcos #smack
Powering Predictive
Mapping at Scale with
Spark, Kafka, and Elastic
Search
Spark Summit East
February 08, 2017
© 2016 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Distributed Systems Engineer
@joerg_schad
© 2016 Mesosphere, Inc. All Rights Reserved. 3
HYPERSCALE MEANS VOLUME AND VELOCITY
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product Recommendations
© 2016 Mesosphere, Inc. All Rights Reserved. 4
SMACK stack
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events
per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and build
data driven applications
DC/OS
Sensors
Devices
Clients
© 2016 Mesosphere, Inc. All Rights Reserved. 5
NAIVE APPROACH
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Industry Average
12-15% utilization
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
© 2016 Mesosphere, Inc. All Rights Reserved. 6
Mesos &
DC/OS
© 2016 Mesosphere, Inc. All Rights Reserved. 7
MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
© 2016 Mesosphere, Inc. All Rights Reserved. 8
DC/OS ENABLES MODERN DISTRIBUTED APPS
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
Big Data + Analytics EnginesMicroservices (in containers)
Streaming
Batch
Machine Learning
Analytics
Functions &
Logic
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Distributed systems kernel to
abstract resources
Ecosystem of frameworks & apps
Consistent architecture to run on
top of kernel
User Interface (GUI & CLI)
Core system services
(e.g., distributed init, cron, service
discovery, package mgt & installer,
storage)
Any Infrastructure (Physical, Virtual, Cloud)
© 2016 Mesosphere, Inc. All Rights Reserved. 9
EXAMPLE:
REAL-TIME
TRACKING
© 2016 Mesosphere, Inc. All Rights Reserved. 10
GEO-ENABLED IoT
© 2016 Mesosphere, Inc. All Rights Reserved. 11
DATA FLOW
© 2016 Mesosphere, Inc. All Rights Reserved. 12
DEMO
© 2016 Mesosphere, Inc. All Rights Reserved. 13
THANK YOU!
ANY
QUESTIONS?
@dcos
users@dcos.io
/groups/8295652
/dcos
/dcos/examples
/dcos/demos
chat.dcos.io
© 2017 Mesosphere, Inc. All Rights Reserved. 14
Keep it running!
© 2016 Mesosphere, Inc. All Rights Reserved. 15
SERVICE OPERATIONS
● Configuration Updates (ex: Scaling, re-configuration)
● Binary Upgrades
● Cluster Maintenance (ex: Backup, Restore, Restart)
● Monitor progress of operations
● Debug any runtime blockages
© 2016 Mesosphere, Inc. All Rights Reserved. 16
Typical Use: distributed, large-scale data
processing; micro-batching
Why Spark Streaming?
● Micro-batching creates very low
latency, which can be faster
● Well defined role means it fits in well
with other pieces of the pipeline
APACHE SPARK (STREAMING)

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: Spark Summit East talk by Jorg Schad

  • 1.
    © 2016 Mesosphere,Inc. All Rights Reserved. 1 @joerg_schad @dcos #smack Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search Spark Summit East February 08, 2017
  • 2.
    © 2016 Mesosphere,Inc. All Rights Reserved. 2 Jörg Schad Distributed Systems Engineer @joerg_schad
  • 3.
    © 2016 Mesosphere,Inc. All Rights Reserved. 3 HYPERSCALE MEANS VOLUME AND VELOCITY Batch Event ProcessingMicro-Batch Days Hours Minutes Seconds Microseconds Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product Recommendations
  • 4.
    © 2016 Mesosphere,Inc. All Rights Reserved. 4 SMACK stack EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  • 5.
    © 2016 Mesosphere,Inc. All Rights Reserved. 5 NAIVE APPROACH Typical Datacenter siloed, over-provisioned servers, low utilization Industry Average 12-15% utilization mySQL microservice Cassandra Spark/Hadoop Kafka
  • 6.
    © 2016 Mesosphere,Inc. All Rights Reserved. 6 Mesos & DC/OS
  • 7.
    © 2016 Mesosphere,Inc. All Rights Reserved. 7 MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines mySQL microservice Cassandra Spark/Hadoop Kafka
  • 8.
    © 2016 Mesosphere,Inc. All Rights Reserved. 8 DC/OS ENABLES MODERN DISTRIBUTED APPS Datacenter Operating System (DC/OS) Distributed Systems Kernel (Mesos) Big Data + Analytics EnginesMicroservices (in containers) Streaming Batch Machine Learning Analytics Functions & Logic Search Time Series SQL / NoSQL Databases Modern App Components Distributed systems kernel to abstract resources Ecosystem of frameworks & apps Consistent architecture to run on top of kernel User Interface (GUI & CLI) Core system services (e.g., distributed init, cron, service discovery, package mgt & installer, storage) Any Infrastructure (Physical, Virtual, Cloud)
  • 9.
    © 2016 Mesosphere,Inc. All Rights Reserved. 9 EXAMPLE: REAL-TIME TRACKING
  • 10.
    © 2016 Mesosphere,Inc. All Rights Reserved. 10 GEO-ENABLED IoT
  • 11.
    © 2016 Mesosphere,Inc. All Rights Reserved. 11 DATA FLOW
  • 12.
    © 2016 Mesosphere,Inc. All Rights Reserved. 12 DEMO
  • 13.
    © 2016 Mesosphere,Inc. All Rights Reserved. 13 THANK YOU! ANY QUESTIONS? @dcos users@dcos.io /groups/8295652 /dcos /dcos/examples /dcos/demos chat.dcos.io
  • 14.
    © 2017 Mesosphere,Inc. All Rights Reserved. 14 Keep it running!
  • 15.
    © 2016 Mesosphere,Inc. All Rights Reserved. 15 SERVICE OPERATIONS ● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages
  • 16.
    © 2016 Mesosphere,Inc. All Rights Reserved. 16 Typical Use: distributed, large-scale data processing; micro-batching Why Spark Streaming? ● Micro-batching creates very low latency, which can be faster ● Well defined role means it fits in well with other pieces of the pipeline APACHE SPARK (STREAMING)