From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything

From Device to Data Center to Insights
Architectural Considerations for the Internet of Anything
P. Taylor Goetz, Hortonworks
@ptgoetz

About Me
• Tech Staff @ Hortonworks
• PMC Chair, Apache Storm
• ASF Member
• PMC, Apache Incubator, Apache Arrow, Apache
Kylin, Apache Apex
• Mentor/PPMC, Apache Eagle (Incubating), Apache
Mynewt (Incubating), Apache Metron (Incubating),
Apache Gossip (Incubating)

26 billion IoT devices by 2020
-Gartner
http://www.gartner.com/newsroom/id/2636073

IPv4 Address Space: 4.6 billion

IoT Growth
• Everyone here should know IoT is huge
• Sensors, Phones, Connected Cars, Wearables, Software-as-a-
Sensor, ...
• Cuts across virtually all industries

Key Architectural Tiers
• Origin: Devices and Data Sources
• Transport: Orchestrating Bi-Directional Data Flow Between Sources
• Analytics: Analysis of Unbounded (Streaming) and Bounded (Batch)
Data, and Acting in Response

Origin Tier
Birthplace of IoT Data

Origin Tier
• Where data is born, but also a destination
• Sensors and Devices
• Constrained Hubs/Gateways

Origin Tier
Devices are getting smaller, cheaper, and increasingly network
enabled.
Examples:
• RaspberryPi ($35, Full OS)
• ESP8266 (<$5 WiFi-enabled microcontroller)

Origin Tier
Devices in the Origin Tier both transmit and receive data.
• Command and Control
• Actuators (interaction with the physical environment)
• End user alerts and notifications

IoT Protocol Considerations
• Device-Device / Device-Gateway Communication
• Radio Frequency Protocols
• IP-based Protocols

Radio Frequency Protocols
• Typically for very resource-constrained devices (Ex: Wireless
sensors in a home security system)
• Usually involve an intermediary hub/gateway as a protocol bridge
(Ex: Main panel in a home security system)
• Short range
• Low Power

ZigBee
• Intended for low power applications (~2 yr. battery life)
• Low data rates
• Simpler and less expensive that WPANs like Bluetooth

ZigBee
• Range: 10–100 meters LOS (between nodes, but messages can
hop in a mesh network)
• Data Rate: 250 kbit/s
• Supports Star, Tree, and Mesh network topologies
• Requires a coordinator device for every network (usually the
hub/gateway)

Z-Wave
• Targets home automation
• Low power/Low data rate
• Proprietary
• Sole chip vendor

Z-Wave
• Range: ~30 meters LOS (between nodes, but messages can hop)
• Data Rate: 100kbit/s
• Form source-routed mesh-networks (can route around failures/obstacles)
• Devices must be paired
• Requires a primary controller (e.g. the hub/gateway)
• Max 232 devices per network (but networks can be bridged)

Bluetooth/Blootooth LE
• Targets wireless computer and device accessories
• High data rates
• Do not form routed networks like Zigbee and Z-Wave
• Usually one host to many device pairing
• Range: 0.5m (Class 4) - 100m (Class 1)
• Data Rate: 1 Mbit/s - 24 Mbit/s

Thread
• New wireless protocol introduced by Nest (Google/Alphabet), Samsung, ARM, Qualcomm
• Built on top of the same (IEEE 802.15.4) specification as ZigBee
• IPv6-based
• Mesh network with hops supported
• ~250 devices per network
• Very low power (purported years of operation on a single AA with deep sleep modes)
• Very new/unsure future — WiFi, Bluetooth, etc. already ubiquitous

IP-Based Protocols
• Require a full IP stack
• Higher power consumption
• Longer range (e.g. WiFi)

IP-Based Protocols
CoAP - Constrained Application Protocol
• Designed to be used on micro controllers with as little as 10k of
memory.
• Simple request/response protocol
• Much like HTTP but based on UDP
• Based on the REST model (GET, PUT, POST, DELETE)
• Strong security via DTLS (Datagram Transport Layer Security)

IP-Based Protocols
CoAP - Constrained Application Protocol
• Simple 4-byte header
• Subset of MIME types and HTTP response codes
• Data model agnostic
• one-to-one
• Tranport (UDP) <— Base Messaging (Simple Confirmable/Non-
Confirmable message transfer) <— REST Semantics

IP-Based Protocols
MQTT - Message Queue Telemetry Transport
• Pub/Sub messaging protocol
• Requires a broker (though brokers can be lightweight)
• many-to-many broadcast

IP-Based Protocols
• Message == Topic + Payload
• Topics: users/ptgoetz/office/thermostat
• Topic wildcards:
• Single level (+): users/ptgoetz/+/thermostat
• Multi-level (#): users/ptgoetz/office/#
• Payload: Just a bunch of bytes (you define the schema)

IP-Based Protocols
• Delivery guarantees (QoS):
• 0: At-most-once
• 1: At-least-once
• 2: Exactly-once
• Last will and testament (when a device goes offline)
• Security via SSL/TLS

Apache Mynewt (incubating)
• Real-time, modular OS for IoT devices
• Designed for use in devices with power, memory and
storage constraints
• Support for many ARM Cortex-M based boards
(including Arduino)
• HAL for unified access to MCU features
• Connectivity with Bluetooth LE
• WiFi, CoAP, and Thread support (roadmap)
• Remote Firmware Upgrades
• Command-line tools for package management

Transport Tier
Data Flow From Device to Data Center

Transport Tier
• Connecting Edge Devices:
• To and from the Analytics Tier (data center)
• To and from one another (inter-device communication)
• Bridging Protocols:
• e.g. WPAN to IP
• Collecting/Transforming/Enriching Data in Motion

Apache NiFi
• Data flow orchestration tool
• Guaranteed Delivery
• Data provenance (important in the Analytics
Tier)
• Backpressure with release
• Flow-specific QoS
• Web-based UI for editing data flows
• Data flows modifiable at runtime
• Supports bi-directional data flows
• Integrates with just about any system

Apache NiFi
Basic Concepts
• Flow File: Unit of user data with associated
key-value metadata
• Processor: Components for creating, sending,
receiving, transforming, routing, etc. Flow Files
• Connection: Acts as the link between
processors.
• Flow Controller: Brokers the exchange of data
between processors
• Process Group: Set of Processors and
Connections with Input/Output ports. New
components can be created by composition.

Apache NiFi minifi
• Supplement to NiFi for constrained
devices/environments
• More suitable for edge devices
• Small footprint
• Designed to collect data near where it
originates an integrate with NiFi

Apache NiFi
For more information:
• https://nifi.apache.org
Some of the best technical
documentation I’ve ever seen:
• https://nifi.apache.org/docs.html

Analytics Tier
Acting on Insights

Analytics Tier
• Where IoT data often (but not always) intersects with Big Data
platforms and Cloud Computing
• Vertical scaling may suffice

Analytics Tier
• Many, many options…
• [insert your definition of Hadoop here]

Analytics Tier
Key Platform Considerations:
• Unbounded (Stream) data processing frequently necessary
• Apache Storm, Apache Flink, etc.
• Bounded (Batch) data processing frequently necessary
• e.g. Training machine learning models, etc.
• Apache Hadoop M/R, Apache Flink, Apache Spark
• Time Series DB a common requirement
• Apache HBase, Apache Cassandra, etc.

Analytics Tier
Key Platform Considerations:
• Latency matters for many use cases
• Latency can add up quickly, depending on the number of “hops”
• Windowing semantics and flexibility

When?
The importance of event time(s).

What is Event Time and why is it so
important?
• Event Times: Origin Time vs. Processing Time
• Ex: Airplane Mode
• Other types of Event Time:
• Enrichment Time
• Ingest Time
• Processing Time 1, 2, n…
• Exit Time (e.g. “return” events, C2, bi-directional communication)

Choose a platform/API that gives you
the most flexibility with respect to
dealing with various event times.

Future-Proofing and Scaling
Small to Medium Scale:
• Not Big Data
• Investment in large-scale distributed system infrastructure wouldn’t
make sense.
• YAGNI (Yet…)
• Vertical scaling may suffice

Future-Proofing and Scaling
Medium to Large Scale:
• A single server is no longer cutting it
• “V”s are starting to pile up
• Need to move to a distributed architecture to scale with increasing
demand
• Your data is now Big

Apache Beam (incubating)
• Unified API for dealing with
bounded/unbounded data sources
(i.e. batch/streaming)
• One API. Multiple implementations
(execution engines). Called
“Runners” in Beamspeak.

• Major focus on Windowing and
properly dealing with Event Time(s)
• Sliding Windows, Tumbling Windows,
Session Windows, etc.
• Watermark capabilities for dealing
with late data

• Runner/Execution Engine Availability
• Local runner (single machine)
• Runners for Google Cloud
Dataflow, Flink and Spark
• Others underway: Apache Storm,
Apache Apex and others

• Choose the right runner for your
current scaling and organizational
needs (you can switch later as as
necessary)
• Understand the limits of different
runner implementations
• Outside of Google Data Flow, the
Flink runner is currently the most
feature-complete (this will change)

For a technical deep dive into Apache
Beam:
Apache Beam: A Unified Model for
Batch and Streaming Data
Processing
- Davor Bonaci, Google Inc.
Thursday 4:10PM, Ballroom A

Firmware, Parsers, and
Schemas
(Oh my!)

Problem: Data Formats
• Many IoT devices transmit data as a raw array of bytes
• The format of that data may be proprietary
• To be of any use it must be parsed into a machine-readable format
(i.e. Schema)
• Once parsed, you need to know the schema

Problem: Firmware Versions
• Deployed IoT devices may be running any number of versions
• Data formats may differ between firmware versions
• Multiple parsers may be necessary to accommodate different device
types and firmware versions

Solution: Parser Registry
• Allow manufacturers to supply proprietary parsers, load at runtime
• Parser API to include way to discover schema
• Tag data with device type + firmware version at the hub/gateway
• Look up associated parser when data arrives
• (This can be done either in either the Transport or Analytics tier)

Solution: Schema Registry
• When parsers are registered, also register the associated schema
• Downstream components (Transport/Analytics Tier) discover schema
based on metadata

Who owns your IoT data?
Hint: It may not be you.

Who owns your data?
• Beware of 3rd-party device manufacturers
• Data is valuable, and everyone wants it
• Frequently exclusive access

Who owns your data?
• Device manufacturers may hoard data.
• Retention policies limit how long you can store the data.
• Aggregate/Derivative data okay, but what’s the definition?

Thank you!
Questions?
P. Taylor Goetz, Hortonworks
@ptgoetz

From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything

Similar to From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything (20)

More from P. Taylor Goetz

More from P. Taylor Goetz (8)

Recently uploaded

Recently uploaded (20)

From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything

Editor's Notes