1. Fast Data:
A Paradigm for the Demands of Efficient IoT Solutions
Stephen Dillon
Big Data Architect
@StephenDillon15
stephen.dillon@schneider-electric.com
http://www.linkedin.com/in/stephendillon/
Stephen Dillon
2. Agenda
● Goals
● Background
● Genesis of Fast Data
– Why has it emerged?
– IoT
– Big Data
– Influences on Fast Data
● Fast Data
– Define & Describe
– Fog Computing
● Examples of Technologies
– Review of Apache Spark
Stephen Dillon
3. Goals
● Be able to answer “What is Fast Data?
● Understand why we care about it.
● Expose you to Apache Spark
Stephen Dillon
4. About Me
● IoT platform team
– Innovation team since 2016
● Focus on technology
– 6 months, 1 year, 3 years out
– Big Data & DB Technologies
● NoSQL, NewSQL, Streaming
● Distributed data
– Proofs of concept, Best Practices
● Technical leadership, IP, papers
5. Recent Work in 2016
● Recent white paper:
– “IoT and the Pervasive Nature of Fast Data and Apache Spark”
– bit.ly/1Td6KFU
● Co-inventor on 2 patent submissions
– using Spark
● Co-authored upcoming research paper on
Federated Data queries
Stephen Dillon
7. Why Fast Data?
● Growth of IoT
● Mobility of IoT
– Demands lower latency
● Complexity of analytics
– Graph theory
– Predictive Analytics
– Machine Learning
Stephen Dillon
8. Internet of Things
“The internet of things (IoT) is the network of
physical objects—devices, vehicles, buildings
and other items—embedded with electronics,
software, sensors, and network connectivity that
enables these objects to collect and exchange
data.” - Wikipedia
Stephen Dillon
10. IoT is Not Only about Hardware
It's also what you do with the Data that matters!
11. Why does it matter?
● Sensors collect data
● Data fuels analytics
● Analytics support business
● Derive actionable insights from the data
Stephen Dillon
13. Classic Definition
"...data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or
doesn't fit the structures of your database architectures. To
gain value from this data, you must choose an alternative
way to process it."
Stephen Dillon
14. Big Data
● Volume
– A lot of it
● Velocity
– Ingress at high frequencies
● Variety
– Multi-structured not unstructured
– Data from disparate sources
– Different data points are captured
Stephen Dillon
15. Influences on Fast Data
● NoSQL
● Hadoop Framework
● Mapreduce & Batch Analytics
● NewSQL (in-memory DBs & distributed row
stores)
Stephen Dillon
Led to 3 significant concepts…
16. 3 Significant Concepts
● Distributed Data storage
● Horizontal, shared-nothing, scale-out
architecture
● In-memory processing…RAM is the new disk
Stephen Dillon
18. Definition
Fast Data is a paradigm that supports "...as-it-happens information
enabling real-time decision-making.“ [1]. It encompasses not only
the ingestion of data at speed but also the processing of the
data, deriving actionable insights from it, and the speed of delivery
of the results. It truly encompasses the Variety and Volume of data
at Velocity in all aspects.
[1] Alissa Lorentz, “Big Data, Fast Data, Smart Data”, Stephen Dillon
19. Characteristics
● It’s a paradigm, not a technology.
● A subset of Big Data
● Describes data in motion
● Data ingestion is a key tenet but...
– Not only about Velocity of data ingestion
Stephen Dillon
20. Fast Data Solutions
● Streaming
● Interactive queries (batch & real-time)
● In-memory capability
● Provides low-latency of ingestion, processing,
delivery
Stephen Dillon
23. What is it?
● Only similar to “Edge” computing
– Fog pushes processing to a Fog node or gateway
– Edge places it on devices
● A decentralized computing infrastructure
● Move your compute resources & application
services closer to the data
● The goal is to improve efficiency and reduce the
amount of data that needs to be transported to
the cloud for data processing, analysis and
storage. Stephen Dillon
26. Apache Spark
● A distributed compute engine that supports Fast Data via its in-
memory, distributed processing capability and its bundled APIs.
It can "...run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk."
Databricks Stephen Dillon
27. What Makes it Fast?
● Map Reduce on steroids
1. Spark passes data directly to other operations
2. In-memory processing of distributed data
3. JVM on each executor
Stephen Dillon
32. STATE OF THE ART
Commercial & Open-Source
Stephen Dillon
33. Spark Core Concepts
● RDD - Resilient Distributed Dataset
– Join RDDs from different sources
● Dataframes
– Allow you work with data in a table structure
– API for building a relational query plan
● Exactly Once Semantics
– No Duplicate data
Stephen Dillon