Ultralight Data Movement
for IoT with SDC Edge
Presented by
Guglielmo Iozzia
Berlin, April 18th
2018
Something About Me
 Big Data Delivery Lead
at Optum (UHG)
 Previously at IBM
and FAO of the UN
 Current fields of expertise are Big Data,
ML/DL and DevOps
 Past experience in JVM languages
development (Java, Groovy, Scala), test
automation, CI/CD
Something About Me
 Author of the upcoming book
“Hands-on Deep Learning
with Apache Spark”
 I love preparing
home-made pizza
I the Dublin Tech Hub
Agenda
 Challenges of Data Ingestion from the Edge
 Streamsets Data Collector
 Features
 Core concepts
 Streamsets Data Collector Edge
 Overview
 Demos
 SDC (RDBMS + Kafka)
 SDC Edge (Android + InfluxDb + Grafana)
 Q & A
Challenges of Data Ingestion
from the Edge
 Every day increasing amount of data being generated
from outside the data center or cloud.
 New scenarios (Industry 4.0, IoT, smartphones).
 It isn’t always easy to get data out of source systems or
perform analytics right where it’s generated.
 Getting data into central big data systems is an arduous
task involving a large number of disjointed, poorly
instrumented and often hand coded technologies.
 Limited resources (memory, CPU, connectivity).
 Unexpected changes (Data Drift).
 Live management of thousands of edge pipelines:
difficult to operate at scale.
A New Way for Data Ingestion
 Reduction/elimination of redundancy:
 Data redundancy brings significant costs in terms of storage.
 It has impact on query performance.
 It makes the data scientists work harder (lots of useless info to
skip while building a model).
 Data synchronicity across different consumers.
 Control over data quality.
 The three phases of traditional ETL could happen on a
single system.
 Data Streaming
 Simpler architecture design and maintenance
What’s Streamsets Data
Collector (SDC)?
 It is a tool to design complex data flows with minimal
coding and the maximum flexibility.
 It provides real-time data flow statistics and metrics for
each flow stage.
 It provides automated error handling and alerting.
 It is easy to use (drag-and-drop from a web UI).
 It ensures zero-downtime when upgrading the
underlying infrastructure.
 It handles data serialization.
 It is Open Source.
SDC Use Cases
 Apache Kafka Enablement
 Connecting applications to Kafka without writing a single line
of code.
 Hadoop Ingestion
 Easy continuously data ingestion into Hadoop and its
surrounding ecosystem.
 Cloud Migration
 Data migrate onto or across cloud providers.
 Search Enablement
 Easy population of your search solution of choice with data
from any source.
SDC Core Concepts
 Origin
 Represents the source for the pipeline.
 Processor
 It's a stage that represents a type of data processing that you
want to perform.
 Destination
 Represents the target for a pipeline.
 Executor
 It’s a stage that triggers a task when it receives an event.
SDC Origins
 Cloud platforms
 Local and remote file
systems
 HTTP and REST API
 Kafka
 Hadoop
 Relational
Databases
 MQTT
SDC Destinations
 Cloud platforms
 Local and remote FS
 Cassandra
 Kafka
 Hadoop
eco-system
 Relational
Databases
 MQTT
 Search Engines
SDC Processors
 Field manipulators
 Lookup
 Expression evaluator
 Parsers
 Stream selector
 Script (JavaScript, Python, Jython, Groovy)
evaluators
 Spark evaluator
 Schema generator
SDC: other topics
 Performance
 Security
 CI/CD
 SDK
 REST API
SDC Demo
What’s SDC Edge?
 It is an ultra lightweight agent that can run pipelines
designed in SDC to ship data in and out of systems.
 It is written in Go and compiles down to a <5MB
executable that has no dependencies.
 It is Open Source.
 No dependency on external IoT Gateways.
 Can perform routing and filtering logic on edge
pipelines (architected for Edge Analytics).
 It runs natively on different platforms:
What’s SDC Edge?
 It supports leading messaging protocols including
HTTP, MQTT, CoAP, and WebSockets.
 It can Detect and handle data drift.
 Multiple pipelines can run at the same time per agent.
SDC Edge Use Cases
 Internet of Things (IoT)
 Reliably ingest and apply machine learning and other analytic
techniques to data aggregated from huge populations of IoT
sensors and devices.
 Cybersecurity
 Ingest and apply advanced analytics to the vast quantities of
data collected across a corporate network in order to detect
imminent threats or attacks in progress.
Simplified IoT Architecture
From
To
SDC Edge Demo
Useful Links
Streamsets Data Collector docs:
https://streamsets.com/documentation/datacollector/latest/help/#datacollecto
Streamsets Data Collector on GitHub:
https://github.com/streamsets/datacollector
Streamsets Data Collector Edge docs:
https://streamsets.com/products/sdc-edge
Streamsets Data Collector Edge on GitHub:
https://github.com/streamsets/datacollector-edge
Sdc-user Google group:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Ask Streamsets: https://ask.streamsets.com/questions/
Q & A
Wrap Up
Linkedin: https://ie.linkedin.com/in/giozzia
Twitter: @GuglielmoIozzia
Blog: googlielmo.blogspot.com
DZone: https://dzone.com/users/2532948/virtualramblas.html
Hands-On Deep Learning with Apache Spark:
https://www.packtpub.com/big-data-and-business-intelligence/hands-
deep-learning-apache-spark

Ultralight Data Movement for IoT with SDC Edge

  • 1.
    Ultralight Data Movement forIoT with SDC Edge Presented by Guglielmo Iozzia Berlin, April 18th 2018
  • 2.
    Something About Me Big Data Delivery Lead at Optum (UHG)  Previously at IBM and FAO of the UN  Current fields of expertise are Big Data, ML/DL and DevOps  Past experience in JVM languages development (Java, Groovy, Scala), test automation, CI/CD
  • 3.
    Something About Me Author of the upcoming book “Hands-on Deep Learning with Apache Spark”  I love preparing home-made pizza
  • 4.
    I the DublinTech Hub
  • 5.
    Agenda  Challenges ofData Ingestion from the Edge  Streamsets Data Collector  Features  Core concepts  Streamsets Data Collector Edge  Overview  Demos  SDC (RDBMS + Kafka)  SDC Edge (Android + InfluxDb + Grafana)  Q & A
  • 6.
    Challenges of DataIngestion from the Edge  Every day increasing amount of data being generated from outside the data center or cloud.  New scenarios (Industry 4.0, IoT, smartphones).  It isn’t always easy to get data out of source systems or perform analytics right where it’s generated.  Getting data into central big data systems is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.  Limited resources (memory, CPU, connectivity).  Unexpected changes (Data Drift).  Live management of thousands of edge pipelines: difficult to operate at scale.
  • 7.
    A New Wayfor Data Ingestion  Reduction/elimination of redundancy:  Data redundancy brings significant costs in terms of storage.  It has impact on query performance.  It makes the data scientists work harder (lots of useless info to skip while building a model).  Data synchronicity across different consumers.  Control over data quality.  The three phases of traditional ETL could happen on a single system.  Data Streaming  Simpler architecture design and maintenance
  • 8.
    What’s Streamsets Data Collector(SDC)?  It is a tool to design complex data flows with minimal coding and the maximum flexibility.  It provides real-time data flow statistics and metrics for each flow stage.  It provides automated error handling and alerting.  It is easy to use (drag-and-drop from a web UI).  It ensures zero-downtime when upgrading the underlying infrastructure.  It handles data serialization.  It is Open Source.
  • 9.
    SDC Use Cases Apache Kafka Enablement  Connecting applications to Kafka without writing a single line of code.  Hadoop Ingestion  Easy continuously data ingestion into Hadoop and its surrounding ecosystem.  Cloud Migration  Data migrate onto or across cloud providers.  Search Enablement  Easy population of your search solution of choice with data from any source.
  • 10.
    SDC Core Concepts Origin  Represents the source for the pipeline.  Processor  It's a stage that represents a type of data processing that you want to perform.  Destination  Represents the target for a pipeline.  Executor  It’s a stage that triggers a task when it receives an event.
  • 11.
    SDC Origins  Cloudplatforms  Local and remote file systems  HTTP and REST API  Kafka  Hadoop  Relational Databases  MQTT
  • 12.
    SDC Destinations  Cloudplatforms  Local and remote FS  Cassandra  Kafka  Hadoop eco-system  Relational Databases  MQTT  Search Engines
  • 13.
    SDC Processors  Fieldmanipulators  Lookup  Expression evaluator  Parsers  Stream selector  Script (JavaScript, Python, Jython, Groovy) evaluators  Spark evaluator  Schema generator
  • 14.
    SDC: other topics Performance  Security  CI/CD  SDK  REST API
  • 15.
  • 16.
    What’s SDC Edge? It is an ultra lightweight agent that can run pipelines designed in SDC to ship data in and out of systems.  It is written in Go and compiles down to a <5MB executable that has no dependencies.  It is Open Source.  No dependency on external IoT Gateways.  Can perform routing and filtering logic on edge pipelines (architected for Edge Analytics).  It runs natively on different platforms:
  • 17.
    What’s SDC Edge? It supports leading messaging protocols including HTTP, MQTT, CoAP, and WebSockets.  It can Detect and handle data drift.  Multiple pipelines can run at the same time per agent.
  • 18.
    SDC Edge UseCases  Internet of Things (IoT)  Reliably ingest and apply machine learning and other analytic techniques to data aggregated from huge populations of IoT sensors and devices.  Cybersecurity  Ingest and apply advanced analytics to the vast quantities of data collected across a corporate network in order to detect imminent threats or attacks in progress.
  • 19.
  • 20.
  • 21.
    Useful Links Streamsets DataCollector docs: https://streamsets.com/documentation/datacollector/latest/help/#datacollecto Streamsets Data Collector on GitHub: https://github.com/streamsets/datacollector Streamsets Data Collector Edge docs: https://streamsets.com/products/sdc-edge Streamsets Data Collector Edge on GitHub: https://github.com/streamsets/datacollector-edge Sdc-user Google group: https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user Ask Streamsets: https://ask.streamsets.com/questions/
  • 22.
  • 23.
    Wrap Up Linkedin: https://ie.linkedin.com/in/giozzia Twitter:@GuglielmoIozzia Blog: googlielmo.blogspot.com DZone: https://dzone.com/users/2532948/virtualramblas.html Hands-On Deep Learning with Apache Spark: https://www.packtpub.com/big-data-and-business-intelligence/hands- deep-learning-apache-spark