Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting insights from IoT data with Apache Spark and Apache Bahir


Published on

The Internet of Things (IoT) is all about connected devices that produce and exchange data, and producing insights from these high volumes of data is challenging. On this session, we will start by providing a quick introduction to the MQTT protocol, and focus on using AI and machine learning techniques to provide insights from data collected from IoT devices. We will present some common AI concepts and techniques used by the industry to deploy state-of-the-art smart IoT systems. These techniques allow systems to determined patterns from the data, predict and prevent failures as well as suggest actions that can be used to minimize or avoid IoT device breakdowns on an intelligent way beyond rule-based and database search approaches. We will finish with a demo that puts together all the techniques discussed in an application that uses Apache Spark and Apache Bahir support for MQTT.

Published in: Data & Analytics
  • Be the first to comment

Getting insights from IoT data with Apache Spark and Apache Bahir

  1. 1. Getting Insights from IoT data with Apache Spark & Apache Bahir Luciano Resende June 20th , 2018 1
  2. 2. About me - Luciano Resende 2 Data Science Platform Architect – IBM – CODAIT • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms @lresende1975
  3. 3. 4 Open Source Leadership & Contributions IBM generated open source innovation • 137 Code Open (dWO) projects w/1000+ Github projects • 4 graduates: Node-Red, OpenWhisk, SystemML, Blockchain fabric to full open governance in the last year • Community • IBM focused on 18 strategic communities • Drive open governance in “Centers of Gravity” • IBM Leaders drive key technologies and assure freedom of action The IBM OS Way is now open sourced • Training, Recognition, Tooling • Organization, Consuming, Contributing 2018 / © 2018 IBM Corporation
  4. 4. 5 IBM’s history of strong AI leadership 1997: Deep Blue • Deep Blue became the first machine to beat a world chess champion in tournament play 2011: Jeopardy! • Watson beat two top Jeopardy! champions 1968, 2001: A Space Odyssey • IBM was a technical advisor • HAL is “the latest in machine intelligence” 2018: Open Tech, AI & emerging standards • New IBM centers of gravity for AI • OS projects increasing exponentially • Emerging global standards in AI
  5. 5. Center for Open Source Data and AI Technologies CODAIT codait (French) = coder/coded CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 6
  6. 6. Agenda 7 Internet of Things IoT Use Cases IoT Design Patterns Apache Spark & Apache Bahir Live Demo – Anomaly Detection Summary References Q&A
  7. 7. Internet of Things - IoT 8
  8. 8. What is THE INTERNET OF THINGS (IoT) ? THE TERM IOT WAS FIRST TOUTED BY KEVIN ASHTON, IN 1999. It’s more than just machine to machine communications. It’s about ecosystems of devices that form relevant connections to people and other devices to exchange data. It’s a useful term to describe the world of connected and wearable devices that’s emerging. 9
  9. 9. IoT - INTERACTIONS BETWEEN multiple ENTITIES 10 control observe inform command actuate inform PEOPLE THINGS SOFTWARE
  10. 10. Some IoT EXAMPLES 11 SMART HOMES TRANSPORT WEARABLES INDUSTRY DISPLAYS HEALTH From thermostats to smart switches and remote controls to security systems Self-driving cars, drones for delivering goods , etc Smartwatches, and other devices enabling control and providing monitoring capabilities Robotics, sensors for predicting quality, failures, etc Not only VR, but many new displays enabling gesture controls, haptic interfaces, etc Connected health, in partnership with wearables for monitoring health metrics, among other examples
  11. 11. Some IoT PATTERNS 12 • Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  12. 12. IoT Applications 13
  13. 13. The Weather Company The Weather Company data - Surface observations - precipitation - radar - satellite - personal weather stations - lightning sources - data collected from planes every day - etc.
  14. 14. Home Automation & Security - Multiple connected or standalone devices - Controlled by Voice - Amazon Echo (Alexa) - Google Home - Apple HomePod (Siri) 15
  15. 15. TESLA connected cars CONNECTED VEHICLES IS ONE EXAMPLE OF THE IoT. It’s not just about Google Maps in cars. When Tesla finds a software fault with their vehicle rather than issuing an expensive and damaging recall, they simply updated the car’s operating system over the air. [hcp:// /teslas- air-fix-best-example- yet-internet-things/] 16
  16. 16. AMAZON Go AMAZON GO – No lines, no checkout, just grab and go 17
  17. 17. Industrial Internet of Things 18 - Smart factory - Predictive and remote maintenance - Smart metering and smart grid - Industrial security - Industrial heating, ventilation and air conditioning - Asset tracking and smart logistics Reference:
  18. 18. IoT Design Patterns 19
  19. 19. LAMBDA Architecture Lambda architecture is a data- processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault- tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data 20 Images:
  20. 20. KAPA Architecture 21 Images: The Kappa architecture simplifies the Lambda architecture by removing the batch layer and replacing it with a streaming layer.
  21. 21. REALTIME Data Processing best practice 22 Pub/Sub Component Data Processor Data Storage One recommendation for processing streaming of massive quantities of data is to add a queue component to front-end the processing of the data. This enables a more fault-tolerant solution, where in conjunction with state management, the runtime application can have failures and subsequently continue processing from the same data point.
  22. 22. Building an IoT Application 23
  23. 23. MQTT – IoT Connectivity Protocol 24 • Constrained devices • Low bandwidth connection • Intermittent connections
  24. 24. MQTT – IoT Connectivity Protocol 25 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementations Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) V5 May 2018
  25. 25. MQTT – Quality of Service 26 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  26. 26. MQTT – World usage Smart Home Automation Messaging Notable Mentions: - IBM IoT Platform - AWS IoT - Microsoft IoT Hub - Facebook Messenger 27
  27. 27. Apache Spark 28
  28. 28. Apache Spark Introduction 29 Spark Core Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  29. 29. Apache Spark Evolution 30
  30. 30. Apache Spark – Spark SQL 31 Spark SQL Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  31. 31. Apache Spark – Spark SQL 32 You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] displays the rows
  32. 32. Apache Spark – Spark SQL You can read from data sources using…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV =“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON =“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset‘age’).show() will return all rows of column “age” from this dataset. 33
  33. 33. Apache Spark – Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank]‘age).show() // will return all rows of column “age” from this dataset. 34
  34. 34. Apache Spark – Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 35Image source: Considers the data stream as an unbounded table
  35. 35. Apache Spark – Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 36 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  36. 36. Apache Spark – Spark Streaming 37 Spark Streaming Micro-batch event processing for near-real time analytics e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. No multi-threading or parallel process programming required
  37. 37. Apache Spark – Spark Streaming Also known as discretized stream or DStream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 38
  38. 38. Apache Spark – Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 39
  39. 39. Apache Bahir 40
  40. 40. Apache Bahir Project Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. Currently, Bahir provides extensions for Apache Spark and Apache Flink.
  41. 41. Bahir extensions for Apache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • • Couch DB/Cloudant– Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • 42
  42. 42. Bahir extensions for Apache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub 43
  43. 43. Apache Spark extensions in Bahir Adding Bahir extensions into your application - Using SBT libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” - Using Maven <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 44
  44. 44. Apache Spark extensions in Bahir Submitting applications with Bahir extensions to Spark - Spark-shell bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. - Spark-submit bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 45
  45. 45. Live Demo 46
  46. 46. IoT Analytics – Anomaly Detection The demo environment 47 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto MQTT JavaScript Client MQTT Streaming Connector
  47. 47. IoT Analytics – Anomaly Detection The Moving Z-Score model scores anomalies in a univariate sequential dataset, often a time series. The moving Z-score is a very simple model for measuring the anomalousness of each point in a sequential dataset like a time series. Given a window size w, the moving Z-score is the number of standard deviations each observation is away from the mean, where the mean and standard deviation are computed only over the previous w observations. 48 Reference: Reference: Z-Score Moving Z-Score Moving mean and moving standard deviation
  48. 48. Summary 49
  49. 49. Summary – Take away points IoT Design Patterns - Lambda and Kappa Architecture - Real Time data processing best practice Apache Spark - IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir - Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications - Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming - Anomaly detection using Moving Z-Scores model 50
  50. 50. Join the Apache Bahir community 51
  51. 51. References Apache Bahir Documentation for Apache Spark extensions Source Repositories Demo Repository 52Image source:
  52. 52. 53May 17, 2018 / © 2018 IBM Corporation CALL FOR CODE KEYNOTE With Jeffrey Borek, 10:00 AM
  53. 53. 54March 30 2018 / © 2018 IBM Corporation