Spark Streaming the Industrial IoT
Washington DC Area Spark Interactive
Jim Haughwout, Chief Architect & VP of Software
May 24, 2016
© 2016 Savi Technology • May 25, 2016 • Page 2
Today’s Talk
• Discuss challenges of streaming in general with tips on doing this with Spark
• Special focus: IoT’s complexities of immediately tying together physical and data world
• Our talks are in three parts:
- Part I: Top-level POV of using Spark Streaming for Industrial IoT (Jim)
- Part II: Spark Streaming and Expert Systems – Spark + Drools (James)
- Part III: Overcoming Deficiencies in Streams (Anderson of MetiStream)
About Us
© 2016 Savi Technology • May 25, 2016 • Page 4
Savi Technology
• Sensor analytics solutions for Industrial IoT
• Focus areas are risk and performance
• Customers are Fortune-1000 and government
• Real-time visibility using complex event
processing and machine learning algorithms
• Strategic insights using batch analytics
• Hardware Engineers, Data Engineers,
Software Engineers, and Data Scientists
• HQ in Alexandria; offices across world
Some examples of what we do… HARDWAREAPPLICATIONS SERVICESANALYTICS
© 2016 Savi Technology • May 25, 2016 • Page 5
Our version of Google Now: Parking -> Stationary
© 2016 Savi Technology • May 25, 2016 • Page 6
Progressive streaming analysis of IoT data: Rules + ML
Times in UTC
© 2016 Savi Technology • May 25, 2016 • Page 7
Alerting with predictive analytics: Commercial ETA
• 22 hours out
we predicted
driver would be
late (giving
advanced notice)
• That prediction
was < 5 minutes
vs. actual (on a
68-hour trip)
Times in America/New_York
© 2016 Savi Technology • May 25, 2016 • Page 8
Batch discovery and prescriptive analytics: reducing theft
Third largest transport firm had
2x the median suspect issues
Use of Spark @ Savi
© 2016 Savi Technology • May 25, 2016 • Page 10
We have fully embraced Apache Spark
Spark is the core of our tech stack:
• Using Spark for batch processing since Spark 1.0, for streaming since Spark 1.2.1
- We use discretized streams (DStreams); our fastest batch interval is 1 second
• 24x7 production operation, with full monitoring and high levels of test coverage
• Supporting Fortune-500 customers, managing billions of dollars of stuff in near real-time
• Fully-automated CI & CD with SOC II certification
• We launch new Spark software several times every week—
Push-button with no visible downtime to customers
• Gives use enormous scale and cost advantages vs. traditional enterprise technologies
• Uptime in last 12 months has been 100%—knock on wood
13 months ago we had a brief outage due to a DNS outage in AWS US-West-2
© 2016 Savi Technology • May 25, 2016 • Page 11
Spark is at the core of our “hybrid” Lambda architecture
In-house
Analytic
Tools
Sensor
Readers
Mobile
Apps
Enterprise
Data
Open
Data
Partner
Data
Sensor
Meshes
I N T E G R A T I O N L A Y E RA M Q
P
C o A P
F T P
H T T P
M Q T T
S O A P
T C P
U D P
X M P P
S E R V I N G
L A Y E R
B A T C H L A Y E R
S P E E D L A Y E R
Savi IoT
Adapter
Batch
Processing
Domain
Specific
CEP
Sensor
Agnostic
CEP
Modeling,
Machine
Learning
R S - 2 3 2
U S B
p R F I D
B l u e t o o t
h
Z i g B e e
8 0 2 . 1 1
6 L o W P A
N
a R F I D
G S M
G P R S
3 G
4 G / L T E
S A T C O M
Data
Serving
Layer
Notification
s
Savi Apps
Immutable
Data Store
Customer
Export
REST APIs
© 2016 Savi Technology • May 25, 2016 • Page 12
The Details: Tech stack distributions and versions
Data Applications Tools
 Jetty 9.3.9
 Kafka (0.8.2) via CDH 5.3.3
 Spark (1.4.1 -> 1.6.1)
 Scikit-learn 0.15.2
 Cassandra 2.1.8 via DSE 4.7
 GlusterFS 3.7
 PostgreSQL 9.3.3 with PostGIS
 Hadoop 2.5.0 via CDH 5.3.3
 Hive 0.13.1 via CDH 5.3.3
 Hue 3.7.0 via CDH 5.3.3
 Parquet-format 2.2.0 via Spark
 Parquet-mr 1.6.0 via Spark
 Gobblin 0.7.0
 Drools 6.3.0
 ZooKeeper 3.4.5 via CDH 5.3.3
 Nginx
 Bootstrap
 D3.js, AmCharts, Flot
 WildFly
 Flask
 Shibboleth
 PostgreSQL
 DSE Cassandra
 DSE Solr
 Also mobile on iOS, Android
 Github (Github Flow)
 Ansible
 Docker
 Jenkins
 Maven
 Bower
 Slack
 Fluentd
 Graylog
 Sentry
 Jupyter (PySpark, Folium, Pandas,
Matplotlib, Scikit-learn, etc.)
We program in Scala 2.10, Java 8, Python 2.7, HTML5, LESS.css, and JavaScript
We are hosted in AWS but are not using any AWS-specific solutions (e.g., EMR)
© 2016 Savi Technology • May 25, 2016 • Page 13
Why we chose Spark
• We started on Apache Storm and MapReduce (we use a Lambda architecture)
• Moved to 100% Spark over the last 18 months (finished last Summer)
• Spark is NOT the best at everything
• However, it is advancing quickly
• We are an analytics company: Spark provides a single unified framework
- Speed layer and Batch Layer
- Use by Engineering and Data Science
- Product apps and ad-hoc analytics
• Ultimately this gives us better agility and cost (development + operations)
For more on our journey see: http://bit.do/savi-spark
Spark Streaming @ Savi
Tips and lessons streaming data 24x7
© 2016 Savi Technology • May 25, 2016 • Page 15
Spark Streaming is a different animal
20 seconds
© 2016 Savi Technology • May 25, 2016 • Page 16
 Time is much more precious (and important) in the Streaming world time
- Seconds vs. minutes or hours
- Down-time or interruption is immediately visible to end users—in IoT this can lead to missing key events
- Need to avoid breakdown in stream due to surges or failures—both of which are more common in IoT
 Streaming resource utilization is different than batch
- CPU is rarely the limiting factor
- Memory is less of a limitation than typical for Spark
- I/O is a much more common limiting factor
Some tips and lessons learned managing these differences…
Spark Streaming is a different animal
© 2016 Savi Technology • May 25, 2016 • Page 17
 Tip 1: Leverage Kafka
- Faster than HDFS, more durable than in-memory
- Supports parallel, independent consumption from multiple processing streams
- Supports FIFO within partitions
 Tip 2: DAG of DAGs (DAG of Streaming Apps and Kafka topics)
- Break down process graph—even near real-time—into critical and non-critical paths
- Route non-critical processing to separate streams, with their own persisted queues
- Do same for interactions with lower-durability sources and targets
 Tip 3: (Caveat to Tip 2) Avoid over-complicating your DAG
- Every time you re-queue: you create opportunities to get data out-of-order
- Instead rely on at-least-once processing and add non-more-than-once protection to non-idempotent processing
Tips to defensively architect Spark Streaming
© 2016 Savi Technology • May 25, 2016 • Page 18
 Tip 4: Offload bad data to non-blocking paths
- Bad data will happen
- Design your apps to offload this to non-blocking paths (vs. failing)—keeps the stream alive
 Tip 5: (Caveat to Tip 4) Wind-down if infrastructure fails
- Running a streaming process with broken infrastructure will create lots of bad issues
- Instead wind-down (and alert) and allow Kafka to help you recover
- Wind-down and re-start will often “clear up” network or memory bottlenecks
 Tip 6: Preserve data lineage (and immutability)
- Preserve full data lineage of each stage of processing – will save you when dealing with real-world issues
- Keep everything, even failures – this allows you to replay data for analysis, recovery (you will need it)
Tips to defensively architect Spark Streaming (cont.)
© 2016 Savi Technology • May 25, 2016 • Page 19
 Tip 1: Over-subscribe your cores
- Minimum core count needed is NsourceTopics + 2
- For efficiency over-subscribe your cores.
- High multiples are fine
 Tip 2: Use broadcast variables to
persist shared ephemeral rules
 Tip 3: Limit Kafka topics per App
- Counter-intuitive for defensive programming
- Avoids starvation due to imbalanced loads
 Tip 4: Avoid the shuffle
- Shuffle is tough on I/O,
with streaming it is worse
- Instead rely on Kafka partitioning
- However, Kafka offset partitioning
is still a work-in-progress
Tips: performance tuning Spark Streaming
Streaming Real-world Industrial IoT Data:
“It’s very different than the Canonical Twitter stream analysis teaching example”
© 2016 Savi Technology • May 25, 2016 • Page 21
All the normal “dirty data” issues plus
Streaming means you have to handle much of this in near-real time
IoT + Spark Streaming = Physical + Data (in near real-time)
© 2016 Savi Technology • May 25, 2016 • Page 22
 The “IoT Menagerie”
- Millions of source IPs: white listing is impossible
- Many transport protocols and standard: HTTP, FTP, CoAP, MQTT, TCP, UDP, X12, GPRS
 Several tools available to ingest IoT transactions into your platform
- Even some directly to Kafka for processing by Spark, Storm and Flink
 However, not everything is a simple transaction – most is not
 The “obvious” solution: increasing MAX KAFKA SIZE does not work:
- Bottlenecking and serialization issues
- Ultimately will not be able to increase enough
 Lessons Learned: Use hybrid ingestion
- Append critical metadata immediately at point of ingestion
- Includes transaction ID and digital signature
- Split metadata from payload for complex and large data types
- Keeps memory low and is fully scalable
Challenge 1: Ingesting IoT data
S T R E A M I N G D A T A T Y P E S
Micro-
batches
Simple
transactions
Loggers
Sensor
constellations
30% of xacs
10% of data
20% of xacs
35% of data
5% of xacs
10% of data
45% of xacs
15% of data
MIME media
transactions
<1% of xacs
30% of data
© 2016 Savi Technology • May 25, 2016 • Page 23
Challenge 2: Handing stream interruptions and surges
• Massive increase in stream interruptions
(vs. normal server flows)
- Loss of power
- Movement in and out of coverage
- Bad OTA updates (can cause false DDoS events)
• Often undetected by anyone but Spark
• Overcoming these
- Monitor and alarm on anomalous values
- Tune your fetch rates to avoid overwhelming I/O
- Our Hope: New Spark back-pressure (still in beta)
© 2016 Savi Technology • May 25, 2016 • Page 24
 Transmission, authentication, and formatting errors are
much more frequent in IoT
- Ever had a cellphone call dropped or duplicate text?
- Data is rarely self-describing
- Firmware configuration management issues
- Standards non-compliance
 Duplication is much more common (and complex)
than traditional Lambda
- Duplicate data can “hide” in unique wrappers
- Duplicate data can be obscured by transaction IDs
- Duplicates can come beyond viably sustainable window durations
 Lessons Learned:
- Accept everything—even authentication errors
- Capture entire lineage of processing (metadata and payload)
- Route failures away from DAG—but preserve to replay and recover
- Map data to based atomic unit THEN digitally sign and de-duplicate data
Challenge 3: Cleansing and transforming IoT data
U N I Q U E T R A N S A C T I O N S E T
Duplicate Facts
(from prior set)
Unique Header
Unique Facts
(to this set)
Unique Header
Incomplete Facts
(in this set)
Unique Header
© 2016 Savi Technology • May 25, 2016 • Page 25
Transformation and Cleansing example: “Simple” raw IoT data
$690300SR86506702256878020160321155058-
16060a34ST-663-
000p00105300090008000030b74a67db4333102660000470
00a67fb4333102660000330009cd9b433310266000025000
10g81-
00077.09254000038.8064970066.7000002016032115514
2000010006.926480003.01000020e210000000000000001
00e21000000000000000200e2200000000000000055246a8
© 2016 Savi Technology • May 25, 2016 • Page 26
Transformation and Cleansing: Canonical format for analytics
Turning machine data into cleaned,
self-describing, agnostic data that can
be readily used for analytics and
machine learning
Sensor Message
Universal Read Format
© 2016 Savi Technology • May 25, 2016 • Page 27
 Streaming data is many, many small files
- 100s or 1000s per second
 Adding to HDFS creates the small file problem
- Many files (Name node swamping)
- Much smaller than HDFS block size (inefficient)
 Delaying too long makes batch analysis stale
- Kafka dues not support complex queries
 Lots of back and forth on this; current best practice:
- Organize streams by volume and type into Kafka topics
- Batch extract by topic based on volume AND time
- Ultimately convert to parquet-format for batch analytics
Challenge 4: The small files problem
And now, the hardest challenge: Streaming CEP…
© 2016 Savi Technology • May 25, 2016 • Page 28
Challenge 5: CEP processing messy physical realities
Spark Streaming needs to make decisions quick enough to matter…
In the physical world, real-time gets stale very quickly
© 2016 Savi Technology • May 25, 2016 • Page 29
Sometimes, data just gets lost (or significantly delayed)
0
1
2
3
When streaming the IoT, the time lag of information is ever-present
© 2016 Savi Technology • May 25, 2016 • Page 30
Also, people have been known to “contradict” sensors
 Sometimes legitimate
 Sometimes mistaken
 Sometimes malicious
People will argue that the
sensors (or rules) are wrong
© 2016 Savi Technology • May 25, 2016 • Page 31
Finally, once I alert you to something, I cannot undo it
Human memory is not a batch layer: it’s hard to forget Type I errors
© 2016 Savi Technology • May 25, 2016 • Page 32
 Prioritize Type I bias vs. Type II based on context
 Windowing can be helpful, but not always
- Data can be delayed hours or days (windowing is not cost-effective)
 Use self-healing rule sets (and algorithms)
- Immutable journal data models for state management
- Keep track of multiple time dimensions: latest, most recent
- Keep track of multiple signal dimensions: detected, reported
 Use batch layer to assist with self-healing
- Re-order on review
- Auto-resolve based on new data
 Add a human signals (to build trust)
- Do not hide corrections, make them clear
- Show full time lineage
- Allow human to re-order to understand effects of outages and delays
CEP in IoT: (Timeliness + Good Enough) > (Late + Perfect)
James will dive into this deeper…
© 2016 Savi Technology • May 25, 2016 • Page 33
 Some challenges to overcome streaming IoT
 Once you overcome these—and share insights with customers—
the the real fun begins. There is lots you can do with Spark
 Questions, ideas, comments:
jhaughwout@savi.com
 Starting to open source some tools at:
https://github.com/sensoranalytics/
 Visit us at 3601 Eisenhower Avenue
Thank you!
Spark Streaming the Industrial IoT

Spark Streaming the Industrial IoT

  • 1.
    Spark Streaming theIndustrial IoT Washington DC Area Spark Interactive Jim Haughwout, Chief Architect & VP of Software May 24, 2016
  • 2.
    © 2016 SaviTechnology • May 25, 2016 • Page 2 Today’s Talk • Discuss challenges of streaming in general with tips on doing this with Spark • Special focus: IoT’s complexities of immediately tying together physical and data world • Our talks are in three parts: - Part I: Top-level POV of using Spark Streaming for Industrial IoT (Jim) - Part II: Spark Streaming and Expert Systems – Spark + Drools (James) - Part III: Overcoming Deficiencies in Streams (Anderson of MetiStream)
  • 3.
  • 4.
    © 2016 SaviTechnology • May 25, 2016 • Page 4 Savi Technology • Sensor analytics solutions for Industrial IoT • Focus areas are risk and performance • Customers are Fortune-1000 and government • Real-time visibility using complex event processing and machine learning algorithms • Strategic insights using batch analytics • Hardware Engineers, Data Engineers, Software Engineers, and Data Scientists • HQ in Alexandria; offices across world Some examples of what we do… HARDWAREAPPLICATIONS SERVICESANALYTICS
  • 5.
    © 2016 SaviTechnology • May 25, 2016 • Page 5 Our version of Google Now: Parking -> Stationary
  • 6.
    © 2016 SaviTechnology • May 25, 2016 • Page 6 Progressive streaming analysis of IoT data: Rules + ML Times in UTC
  • 7.
    © 2016 SaviTechnology • May 25, 2016 • Page 7 Alerting with predictive analytics: Commercial ETA • 22 hours out we predicted driver would be late (giving advanced notice) • That prediction was < 5 minutes vs. actual (on a 68-hour trip) Times in America/New_York
  • 8.
    © 2016 SaviTechnology • May 25, 2016 • Page 8 Batch discovery and prescriptive analytics: reducing theft Third largest transport firm had 2x the median suspect issues
  • 9.
  • 10.
    © 2016 SaviTechnology • May 25, 2016 • Page 10 We have fully embraced Apache Spark Spark is the core of our tech stack: • Using Spark for batch processing since Spark 1.0, for streaming since Spark 1.2.1 - We use discretized streams (DStreams); our fastest batch interval is 1 second • 24x7 production operation, with full monitoring and high levels of test coverage • Supporting Fortune-500 customers, managing billions of dollars of stuff in near real-time • Fully-automated CI & CD with SOC II certification • We launch new Spark software several times every week— Push-button with no visible downtime to customers • Gives use enormous scale and cost advantages vs. traditional enterprise technologies • Uptime in last 12 months has been 100%—knock on wood 13 months ago we had a brief outage due to a DNS outage in AWS US-West-2
  • 11.
    © 2016 SaviTechnology • May 25, 2016 • Page 11 Spark is at the core of our “hybrid” Lambda architecture In-house Analytic Tools Sensor Readers Mobile Apps Enterprise Data Open Data Partner Data Sensor Meshes I N T E G R A T I O N L A Y E RA M Q P C o A P F T P H T T P M Q T T S O A P T C P U D P X M P P S E R V I N G L A Y E R B A T C H L A Y E R S P E E D L A Y E R Savi IoT Adapter Batch Processing Domain Specific CEP Sensor Agnostic CEP Modeling, Machine Learning R S - 2 3 2 U S B p R F I D B l u e t o o t h Z i g B e e 8 0 2 . 1 1 6 L o W P A N a R F I D G S M G P R S 3 G 4 G / L T E S A T C O M Data Serving Layer Notification s Savi Apps Immutable Data Store Customer Export REST APIs
  • 12.
    © 2016 SaviTechnology • May 25, 2016 • Page 12 The Details: Tech stack distributions and versions Data Applications Tools  Jetty 9.3.9  Kafka (0.8.2) via CDH 5.3.3  Spark (1.4.1 -> 1.6.1)  Scikit-learn 0.15.2  Cassandra 2.1.8 via DSE 4.7  GlusterFS 3.7  PostgreSQL 9.3.3 with PostGIS  Hadoop 2.5.0 via CDH 5.3.3  Hive 0.13.1 via CDH 5.3.3  Hue 3.7.0 via CDH 5.3.3  Parquet-format 2.2.0 via Spark  Parquet-mr 1.6.0 via Spark  Gobblin 0.7.0  Drools 6.3.0  ZooKeeper 3.4.5 via CDH 5.3.3  Nginx  Bootstrap  D3.js, AmCharts, Flot  WildFly  Flask  Shibboleth  PostgreSQL  DSE Cassandra  DSE Solr  Also mobile on iOS, Android  Github (Github Flow)  Ansible  Docker  Jenkins  Maven  Bower  Slack  Fluentd  Graylog  Sentry  Jupyter (PySpark, Folium, Pandas, Matplotlib, Scikit-learn, etc.) We program in Scala 2.10, Java 8, Python 2.7, HTML5, LESS.css, and JavaScript We are hosted in AWS but are not using any AWS-specific solutions (e.g., EMR)
  • 13.
    © 2016 SaviTechnology • May 25, 2016 • Page 13 Why we chose Spark • We started on Apache Storm and MapReduce (we use a Lambda architecture) • Moved to 100% Spark over the last 18 months (finished last Summer) • Spark is NOT the best at everything • However, it is advancing quickly • We are an analytics company: Spark provides a single unified framework - Speed layer and Batch Layer - Use by Engineering and Data Science - Product apps and ad-hoc analytics • Ultimately this gives us better agility and cost (development + operations) For more on our journey see: http://bit.do/savi-spark
  • 14.
    Spark Streaming @Savi Tips and lessons streaming data 24x7
  • 15.
    © 2016 SaviTechnology • May 25, 2016 • Page 15 Spark Streaming is a different animal 20 seconds
  • 16.
    © 2016 SaviTechnology • May 25, 2016 • Page 16  Time is much more precious (and important) in the Streaming world time - Seconds vs. minutes or hours - Down-time or interruption is immediately visible to end users—in IoT this can lead to missing key events - Need to avoid breakdown in stream due to surges or failures—both of which are more common in IoT  Streaming resource utilization is different than batch - CPU is rarely the limiting factor - Memory is less of a limitation than typical for Spark - I/O is a much more common limiting factor Some tips and lessons learned managing these differences… Spark Streaming is a different animal
  • 17.
    © 2016 SaviTechnology • May 25, 2016 • Page 17  Tip 1: Leverage Kafka - Faster than HDFS, more durable than in-memory - Supports parallel, independent consumption from multiple processing streams - Supports FIFO within partitions  Tip 2: DAG of DAGs (DAG of Streaming Apps and Kafka topics) - Break down process graph—even near real-time—into critical and non-critical paths - Route non-critical processing to separate streams, with their own persisted queues - Do same for interactions with lower-durability sources and targets  Tip 3: (Caveat to Tip 2) Avoid over-complicating your DAG - Every time you re-queue: you create opportunities to get data out-of-order - Instead rely on at-least-once processing and add non-more-than-once protection to non-idempotent processing Tips to defensively architect Spark Streaming
  • 18.
    © 2016 SaviTechnology • May 25, 2016 • Page 18  Tip 4: Offload bad data to non-blocking paths - Bad data will happen - Design your apps to offload this to non-blocking paths (vs. failing)—keeps the stream alive  Tip 5: (Caveat to Tip 4) Wind-down if infrastructure fails - Running a streaming process with broken infrastructure will create lots of bad issues - Instead wind-down (and alert) and allow Kafka to help you recover - Wind-down and re-start will often “clear up” network or memory bottlenecks  Tip 6: Preserve data lineage (and immutability) - Preserve full data lineage of each stage of processing – will save you when dealing with real-world issues - Keep everything, even failures – this allows you to replay data for analysis, recovery (you will need it) Tips to defensively architect Spark Streaming (cont.)
  • 19.
    © 2016 SaviTechnology • May 25, 2016 • Page 19  Tip 1: Over-subscribe your cores - Minimum core count needed is NsourceTopics + 2 - For efficiency over-subscribe your cores. - High multiples are fine  Tip 2: Use broadcast variables to persist shared ephemeral rules  Tip 3: Limit Kafka topics per App - Counter-intuitive for defensive programming - Avoids starvation due to imbalanced loads  Tip 4: Avoid the shuffle - Shuffle is tough on I/O, with streaming it is worse - Instead rely on Kafka partitioning - However, Kafka offset partitioning is still a work-in-progress Tips: performance tuning Spark Streaming
  • 20.
    Streaming Real-world IndustrialIoT Data: “It’s very different than the Canonical Twitter stream analysis teaching example”
  • 21.
    © 2016 SaviTechnology • May 25, 2016 • Page 21 All the normal “dirty data” issues plus Streaming means you have to handle much of this in near-real time IoT + Spark Streaming = Physical + Data (in near real-time)
  • 22.
    © 2016 SaviTechnology • May 25, 2016 • Page 22  The “IoT Menagerie” - Millions of source IPs: white listing is impossible - Many transport protocols and standard: HTTP, FTP, CoAP, MQTT, TCP, UDP, X12, GPRS  Several tools available to ingest IoT transactions into your platform - Even some directly to Kafka for processing by Spark, Storm and Flink  However, not everything is a simple transaction – most is not  The “obvious” solution: increasing MAX KAFKA SIZE does not work: - Bottlenecking and serialization issues - Ultimately will not be able to increase enough  Lessons Learned: Use hybrid ingestion - Append critical metadata immediately at point of ingestion - Includes transaction ID and digital signature - Split metadata from payload for complex and large data types - Keeps memory low and is fully scalable Challenge 1: Ingesting IoT data S T R E A M I N G D A T A T Y P E S Micro- batches Simple transactions Loggers Sensor constellations 30% of xacs 10% of data 20% of xacs 35% of data 5% of xacs 10% of data 45% of xacs 15% of data MIME media transactions <1% of xacs 30% of data
  • 23.
    © 2016 SaviTechnology • May 25, 2016 • Page 23 Challenge 2: Handing stream interruptions and surges • Massive increase in stream interruptions (vs. normal server flows) - Loss of power - Movement in and out of coverage - Bad OTA updates (can cause false DDoS events) • Often undetected by anyone but Spark • Overcoming these - Monitor and alarm on anomalous values - Tune your fetch rates to avoid overwhelming I/O - Our Hope: New Spark back-pressure (still in beta)
  • 24.
    © 2016 SaviTechnology • May 25, 2016 • Page 24  Transmission, authentication, and formatting errors are much more frequent in IoT - Ever had a cellphone call dropped or duplicate text? - Data is rarely self-describing - Firmware configuration management issues - Standards non-compliance  Duplication is much more common (and complex) than traditional Lambda - Duplicate data can “hide” in unique wrappers - Duplicate data can be obscured by transaction IDs - Duplicates can come beyond viably sustainable window durations  Lessons Learned: - Accept everything—even authentication errors - Capture entire lineage of processing (metadata and payload) - Route failures away from DAG—but preserve to replay and recover - Map data to based atomic unit THEN digitally sign and de-duplicate data Challenge 3: Cleansing and transforming IoT data U N I Q U E T R A N S A C T I O N S E T Duplicate Facts (from prior set) Unique Header Unique Facts (to this set) Unique Header Incomplete Facts (in this set) Unique Header
  • 25.
    © 2016 SaviTechnology • May 25, 2016 • Page 25 Transformation and Cleansing example: “Simple” raw IoT data $690300SR86506702256878020160321155058- 16060a34ST-663- 000p00105300090008000030b74a67db4333102660000470 00a67fb4333102660000330009cd9b433310266000025000 10g81- 00077.09254000038.8064970066.7000002016032115514 2000010006.926480003.01000020e210000000000000001 00e21000000000000000200e2200000000000000055246a8
  • 26.
    © 2016 SaviTechnology • May 25, 2016 • Page 26 Transformation and Cleansing: Canonical format for analytics Turning machine data into cleaned, self-describing, agnostic data that can be readily used for analytics and machine learning Sensor Message Universal Read Format
  • 27.
    © 2016 SaviTechnology • May 25, 2016 • Page 27  Streaming data is many, many small files - 100s or 1000s per second  Adding to HDFS creates the small file problem - Many files (Name node swamping) - Much smaller than HDFS block size (inefficient)  Delaying too long makes batch analysis stale - Kafka dues not support complex queries  Lots of back and forth on this; current best practice: - Organize streams by volume and type into Kafka topics - Batch extract by topic based on volume AND time - Ultimately convert to parquet-format for batch analytics Challenge 4: The small files problem And now, the hardest challenge: Streaming CEP…
  • 28.
    © 2016 SaviTechnology • May 25, 2016 • Page 28 Challenge 5: CEP processing messy physical realities Spark Streaming needs to make decisions quick enough to matter… In the physical world, real-time gets stale very quickly
  • 29.
    © 2016 SaviTechnology • May 25, 2016 • Page 29 Sometimes, data just gets lost (or significantly delayed) 0 1 2 3 When streaming the IoT, the time lag of information is ever-present
  • 30.
    © 2016 SaviTechnology • May 25, 2016 • Page 30 Also, people have been known to “contradict” sensors  Sometimes legitimate  Sometimes mistaken  Sometimes malicious People will argue that the sensors (or rules) are wrong
  • 31.
    © 2016 SaviTechnology • May 25, 2016 • Page 31 Finally, once I alert you to something, I cannot undo it Human memory is not a batch layer: it’s hard to forget Type I errors
  • 32.
    © 2016 SaviTechnology • May 25, 2016 • Page 32  Prioritize Type I bias vs. Type II based on context  Windowing can be helpful, but not always - Data can be delayed hours or days (windowing is not cost-effective)  Use self-healing rule sets (and algorithms) - Immutable journal data models for state management - Keep track of multiple time dimensions: latest, most recent - Keep track of multiple signal dimensions: detected, reported  Use batch layer to assist with self-healing - Re-order on review - Auto-resolve based on new data  Add a human signals (to build trust) - Do not hide corrections, make them clear - Show full time lineage - Allow human to re-order to understand effects of outages and delays CEP in IoT: (Timeliness + Good Enough) > (Late + Perfect) James will dive into this deeper…
  • 33.
    © 2016 SaviTechnology • May 25, 2016 • Page 33  Some challenges to overcome streaming IoT  Once you overcome these—and share insights with customers— the the real fun begins. There is lots you can do with Spark  Questions, ideas, comments: jhaughwout@savi.com  Starting to open source some tools at: https://github.com/sensoranalytics/  Visit us at 3601 Eisenhower Avenue Thank you!