Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything


Published on

DataWorks Summit San Jose 2016

Published in: Software
  • Be the first to comment

From Device to Data Center to Insights: Architectural Considerations for the Internet of Anything

  1. 1. From Device to Data Center to Insights Architectural Considerations for the Internet of Anything P. Taylor Goetz, Hortonworks @ptgoetz
  2. 2. About Me • Tech Staff @ Hortonworks • PMC Chair, Apache Storm • ASF Member • PMC, Apache Incubator, Apache Arrow, Apache Kylin, Apache Apex • Mentor/PPMC, Apache Eagle (Incubating), Apache Mynewt (Incubating), Apache Metron (Incubating), Apache Gossip (Incubating)
  3. 3. 26 billion IoT devices by 2020 -Gartner
  4. 4. IPv4 Address Space: 4.6 billion
  5. 5. IoT Growth • Everyone here should know IoT is huge • Sensors, Phones, Connected Cars, Wearables, Software-as-a- Sensor, ... • Cuts across virtually all industries
  6. 6. IoT Architecture
  7. 7. Key Architectural Tiers • Origin: Devices and Data Sources • Transport: Orchestrating Bi-Directional Data Flow Between Sources • Analytics: Analysis of Unbounded (Streaming) and Bounded (Batch) Data, and Acting in Response
  8. 8. Origin Tier Birthplace of IoT Data
  9. 9. Origin Tier • Where data is born, but also a destination • Sensors and Devices • Constrained Hubs/Gateways
  10. 10. Origin Tier Devices are getting smaller, cheaper, and increasingly network enabled. Examples: • RaspberryPi ($35, Full OS) • ESP8266 (<$5 WiFi-enabled microcontroller)
  11. 11. Origin Tier Devices in the Origin Tier both transmit and receive data. • Command and Control • Actuators (interaction with the physical environment) • End user alerts and notifications
  12. 12. IoT Protocol Considerations
  13. 13. IoT Protocol Considerations • Device-Device / Device-Gateway Communication • Radio Frequency Protocols • IP-based Protocols
  14. 14. IoT Protocol Considerations Radio Frequency Protocols • Typically for very resource-constrained devices (Ex: Wireless sensors in a home security system) • Usually involve an intermediary hub/gateway as a protocol bridge (Ex: Main panel in a home security system) • Short range • Low Power
  15. 15. Radio Frequency Protocols ZigBee • Intended for low power applications (~2 yr. battery life) • Low data rates • Simpler and less expensive that WPANs like Bluetooth
  16. 16. Radio Frequency Protocols ZigBee • Range: 10–100 meters LOS (between nodes, but messages can hop in a mesh network) • Data Rate: 250 kbit/s • Supports Star, Tree, and Mesh network topologies • Requires a coordinator device for every network (usually the hub/gateway)
  17. 17. Radio Frequency Protocols Z-Wave • Targets home automation • Low power/Low data rate • Proprietary • Sole chip vendor
  18. 18. Radio Frequency Protocols Z-Wave • Range: ~30 meters LOS (between nodes, but messages can hop) • Data Rate: 100kbit/s • Form source-routed mesh-networks (can route around failures/obstacles) • Devices must be paired • Requires a primary controller (e.g. the hub/gateway) • Max 232 devices per network (but networks can be bridged)
  19. 19. Radio Frequency Protocols Bluetooth/Blootooth LE • Targets wireless computer and device accessories • High data rates • Do not form routed networks like Zigbee and Z-Wave • Usually one host to many device pairing • Range: 0.5m (Class 4) - 100m (Class 1) • Data Rate: 1 Mbit/s - 24 Mbit/s
  20. 20. Radio Frequency Protocols Thread • New wireless protocol introduced by Nest (Google/Alphabet), Samsung, ARM, Qualcomm • Built on top of the same (IEEE 802.15.4) specification as ZigBee • IPv6-based • Mesh network with hops supported • ~250 devices per network • Very low power (purported years of operation on a single AA with deep sleep modes) • Very new/unsure future — WiFi, Bluetooth, etc. already ubiquitous
  21. 21. IoT Protocol Considerations IP-Based Protocols • Require a full IP stack • Higher power consumption • Longer range (e.g. WiFi)
  22. 22. IP-Based Protocols CoAP - Constrained Application Protocol • Designed to be used on micro controllers with as little as 10k of memory. • Simple request/response protocol • Much like HTTP but based on UDP • Based on the REST model (GET, PUT, POST, DELETE) • Strong security via DTLS (Datagram Transport Layer Security)
  23. 23. IP-Based Protocols CoAP - Constrained Application Protocol • Simple 4-byte header • Subset of MIME types and HTTP response codes • Data model agnostic • one-to-one • Tranport (UDP) <— Base Messaging (Simple Confirmable/Non- Confirmable message transfer) <— REST Semantics
  24. 24. IP-Based Protocols MQTT - Message Queue Telemetry Transport • Pub/Sub messaging protocol • Requires a broker (though brokers can be lightweight) • many-to-many broadcast
  25. 25. IP-Based Protocols MQTT - Message Queue Telemetry Transport • Message == Topic + Payload • Topics: users/ptgoetz/office/thermostat • Topic wildcards: • Single level (+): users/ptgoetz/+/thermostat • Multi-level (#): users/ptgoetz/office/# • Payload: Just a bunch of bytes (you define the schema)
  26. 26. IP-Based Protocols MQTT - Message Queue Telemetry Transport • Delivery guarantees (QoS): • 0: At-most-once • 1: At-least-once • 2: Exactly-once • Last will and testament (when a device goes offline) • Security via SSL/TLS
  27. 27. Apache Mynewt (incubating) • Real-time, modular OS for IoT devices • Designed for use in devices with power, memory and storage constraints • Support for many ARM Cortex-M based boards (including Arduino) • HAL for unified access to MCU features • Connectivity with Bluetooth LE • WiFi, CoAP, and Thread support (roadmap) • Remote Firmware Upgrades • Command-line tools for package management
  28. 28. Transport Tier Data Flow From Device to Data Center
  29. 29. Transport Tier • Connecting Edge Devices: • To and from the Analytics Tier (data center) • To and from one another (inter-device communication) • Bridging Protocols: • e.g. WPAN to IP • Collecting/Transforming/Enriching Data in Motion
  30. 30. Apache NiFi
  31. 31. Apache NiFi • Data flow orchestration tool • Guaranteed Delivery • Data provenance (important in the Analytics Tier) • Backpressure with release • Flow-specific QoS • Web-based UI for editing data flows • Data flows modifiable at runtime • Supports bi-directional data flows • Integrates with just about any system
  32. 32. Apache NiFi Basic Concepts • Flow File: Unit of user data with associated key-value metadata • Processor: Components for creating, sending, receiving, transforming, routing, etc. Flow Files • Connection: Acts as the link between processors. • Flow Controller: Brokers the exchange of data between processors • Process Group: Set of Processors and Connections with Input/Output ports. New components can be created by composition.
  33. 33. Apache NiFi minifi • Supplement to NiFi for constrained devices/environments • More suitable for edge devices • Small footprint • Designed to collect data near where it originates an integrate with NiFi
  34. 34. Apache NiFi For more information: • Some of the best technical documentation I’ve ever seen: •
  35. 35. Analytics Tier Acting on Insights
  36. 36. Analytics Tier • Where IoT data often (but not always) intersects with Big Data platforms and Cloud Computing • Vertical scaling may suffice
  37. 37. Analytics Tier • Many, many options… • [insert your definition of Hadoop here]
  38. 38. Analytics Tier Key Platform Considerations: • Unbounded (Stream) data processing frequently necessary • Apache Storm, Apache Flink, etc. • Bounded (Batch) data processing frequently necessary • e.g. Training machine learning models, etc. • Apache Hadoop M/R, Apache Flink, Apache Spark • Time Series DB a common requirement • Apache HBase, Apache Cassandra, etc.
  39. 39. Analytics Tier Key Platform Considerations: • Latency matters for many use cases • Latency can add up quickly, depending on the number of “hops” • Windowing semantics and flexibility
  40. 40. When? The importance of event time(s).
  41. 41. What is Event Time and why is it so important? • Event Times: Origin Time vs. Processing Time • Ex: Airplane Mode • Other types of Event Time: • Enrichment Time • Ingest Time • Processing Time 1, 2, n… • Exit Time (e.g. “return” events, C2, bi-directional communication)
  42. 42. Choose a platform/API that gives you the most flexibility with respect to dealing with various event times.
  43. 43. Future-Proofing and Scaling Small to Medium Scale: • Not Big Data • Investment in large-scale distributed system infrastructure wouldn’t make sense. • YAGNI (Yet…) • Vertical scaling may suffice
  44. 44. Future-Proofing and Scaling Medium to Large Scale: • A single server is no longer cutting it • “V”s are starting to pile up • Need to move to a distributed architecture to scale with increasing demand • Your data is now Big
  45. 45. Apache Beam (incubating) • Unified API for dealing with bounded/unbounded data sources (i.e. batch/streaming) • One API. Multiple implementations (execution engines). Called “Runners” in Beamspeak.
  46. 46. Apache Beam (incubating) • Major focus on Windowing and properly dealing with Event Time(s) • Sliding Windows, Tumbling Windows, Session Windows, etc. • Watermark capabilities for dealing with late data
  47. 47. Apache Beam (incubating) • Runner/Execution Engine Availability • Local runner (single machine) • Runners for Google Cloud Dataflow, Flink and Spark • Others underway: Apache Storm, Apache Apex and others
  48. 48. Apache Beam (incubating) • Choose the right runner for your current scaling and organizational needs (you can switch later as as necessary) • Understand the limits of different runner implementations • Outside of Google Data Flow, the Flink runner is currently the most feature-complete (this will change)
  49. 49. Apache Beam (incubating) For a technical deep dive into Apache Beam: Apache Beam: A Unified Model for Batch and Streaming Data Processing - Davor Bonaci, Google Inc. Thursday 4:10PM, Ballroom A
  50. 50. Firmware, Parsers, and Schemas (Oh my!)
  51. 51. Problem: Data Formats • Many IoT devices transmit data as a raw array of bytes • The format of that data may be proprietary • To be of any use it must be parsed into a machine-readable format (i.e. Schema) • Once parsed, you need to know the schema
  52. 52. Problem: Firmware Versions • Deployed IoT devices may be running any number of versions • Data formats may differ between firmware versions • Multiple parsers may be necessary to accommodate different device types and firmware versions
  53. 53. Solution: Parser Registry • Allow manufacturers to supply proprietary parsers, load at runtime • Parser API to include way to discover schema • Tag data with device type + firmware version at the hub/gateway • Look up associated parser when data arrives • (This can be done either in either the Transport or Analytics tier)
  54. 54. Solution: Schema Registry • When parsers are registered, also register the associated schema • Downstream components (Transport/Analytics Tier) discover schema based on metadata
  55. 55. Who owns your IoT data? Hint: It may not be you.
  56. 56. Who owns your data? • Beware of 3rd-party device manufacturers • Data is valuable, and everyone wants it • Frequently exclusive access
  57. 57. Who owns your data? • Device manufacturers may hoard data. • Retention policies limit how long you can store the data. • Aggregate/Derivative data okay, but what’s the definition?
  58. 58. Thank you! Questions? P. Taylor Goetz, Hortonworks @ptgoetz