Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IoT Applications and Patterns using Apache Spark & Apache Bahir

110 views

Published on

The Internet of Things (IoT) is all about connected devices that produce and exchange data, and building applications that produce insights from these high volumes of data is very challenging and require a understanding of multiple protocols, platforms and other components. On this session, we will start by providing a quick introduction to IoT, some of the common analytic patterns used on IoT, and also touch on the MQTT protocol and how it is used by IoT solutions some of the quality of services tradeoffs to be considered when building an IoT application. We will also discuss some of the Apache Spark platform components, the ones utilized by IoT applications to process devices streaming data.

We will also talk about Apache Bahir and some of its IoT connectors available for the Apache Spark platform. We will also go over the details on how to build, test and deploy an IoT application for Apache Spark using the MQTT data source for the new Apache Spark Structure Streaming functionality.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

IoT Applications and Patterns using Apache Spark & Apache Bahir

  1. 1. IoT Applications and Patterns using Apache Spark & Apache Bahir Luciano Resende June 14th, 2018 © 2018 IBM Corporation 1
  2. 2. About me - Luciano Resende 2 Data Science Platform Architect – IBM – CODAIT • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms lresende@apache.org https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende
  3. 3. Open Source Community Leadership C O D A I T Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee memberships OSS Advisory Board Open Source
  4. 4. Center for Open Source Data and AI Technologies CODAIT codait.org codait (French) = coder/coded https://m.interglot.com/fr/en/codait CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 5
  5. 5. Agenda 6 Introductions - Apache Spark - Apache Bahir IoT Applications Live Demo Summary References Q&A
  6. 6. Apache Spark 7
  7. 7. Apache Spark Introduction 8 Spark Core Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  8. 8. Apache Spark Evolution 9
  9. 9. Apache Spark – Spark SQL 10 Spark SQL Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  10. 10. Apache Spark – Spark SQL 11 You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] ds.show() displays the rows
  11. 11. Apache Spark – Spark SQL You can read from data sources using SparkSession.read.format(…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset. 12
  12. 12. Apache Spark – Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = sparkSession.read .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank] bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset. 13
  13. 13. Apache Spark – Spark SQL – Data Sources Data Sources under the covers - Data source registration (e.g. spark.read.datasource) - Provide BaseRelation implementation • That implements support for table scans: – TableScans, PrunedScan, PrunedFilteredScan, CatalystScan - Detailed information available at • https://developer.ibm.com/code/2016/11/10/exploring-apache-spark-datasource-api/ 14
  14. 14. Apache Spark – Spark SQL – Data Sources Data Sources V1 Limitations - Leak upper-level API in the data source (DataFrame/SQLContext) - Hard to extend the Data Sources API for more optimizations - Zero transaction guarantee in the write APIs - Limited Extensibility 15
  15. 15. Apache Spark – Spark SQL – Data Sources Data Sources V2 - Support for row-based scan and columnar scan - Column pruning and filter push-down - Can report basic statistics and data partitioning - Transactional write API - Streaming source and sink support for micro-batch and continuous mode - Detailed information available at • https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/ 16
  16. 16. Apache Spark – Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 17Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Considers the data stream as unbounded table
  17. 17. Apache Spark – Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.read .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 18 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  18. 18. Apache Spark – Spark Streaming 19 Spark Streaming Micro-batch event processing for near-real time analytics e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. No multi-threading or parallel process programming required
  19. 19. Apache Spark – Spark Streaming Also known as discretized stream or DStream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 20
  20. 20. Apache Spark – Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 21
  21. 21. Apache Bahir 22
  22. 22. Origins of the Apache Bahir Project MAY/2016: Established as a top-level Apache Project. - PMC formed by Apache Spark committers/pmc, Apache Members - Initial contributions imported from Apache Spark AUG/2016: Apache Flink community join Apache Bahir - Initial contributions of Flink extensions - In October 2016 Robert Metzger elected committer
  23. 23. Origins of the Bahir name Naming an Apache Project is a science !!! - We needed a name that wasn’t used yet - Needed to be related to Spark We ended up with : Bahir - A name of Arabian origin that means Sparkling, - Also associated with a guy who succeeds at everything
  24. 24. Why Apache Bahir It’s an Apache project - And if you are here, you know what it means Benefits of curating your extensions at Apache Bahir - Apache Governance - Apache License - Apache Community - Apache Brand 25
  25. 25. Why Apache Bahir Flexibility - Release flexibility • Bounded to platform or component release Shared infrastructure - Release, CI, etc Shared knowledge - Collaborate with experts on both platform and component areas 26
  26. 26. Bahir extensions for Apache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/ • http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/ Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-akka/ ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/ 27
  27. 27. Bahir extensions for Apache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub 28
  28. 28. Apache Spark extensions in Bahir Adding Bahir extensions into your application - Using SBT libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” - Using Maven <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 29
  29. 29. Apache Spark extensions in Bahir Submitting applications with Bahir extensions to Spark - Spark-shell bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. - Spark-submit bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 30
  30. 30. Internet of Things - IoT 31
  31. 31. IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 32
  32. 32. IoT – Interaction between multiple entities 33 Things Software People control observe inform command actuate inform
  33. 33. IoT Patterns – Some of them … 35 • Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  34. 34. MQTT – M2M / IoT Connectivity Protocol 37 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementatio ns Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) V5 May 2018
  35. 35. MQTT – Quality of Service 38 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  36. 36. MQTT – World usage Smart Home Automation Messaging Notable Mentions: - IBM IoT Platform - AWS IoT - Microsoft IoT Hub - Facebook Messanger 39
  37. 37. Live Demo 40
  38. 38. IoT Simulator using MQTT The demo environment https://github.com/lresende/bahir-iot-demo 41 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto
  39. 39. Summary 42
  40. 40. Summary – Take away points Apache Spark - IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir - Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications - Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming and Spark Structured Streaming 43
  41. 41. Join the Apache Bahir community 44
  42. 42. References Apache Bahir http://bahir.apache.org Documentation for Apache Spark extensions http://bahir.apache.org/docs/spark/current/documentation/ Source Repositories https://github.com/apache/bahir https://github.com/apache/bahir-website Demo Repository https://github.com/lresende/bahir-iot-demo 45Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
  43. 43. 46March 30 2018 / © 2018 IBM Corporation

×