Ultralight Data Movement for IoT with SDC Edge

Ultralight Data Movement
for IoT with SDC Edge
Presented by
Guglielmo Iozzia
Berlin, April 18th
2018

Something About Me
 Big Data Delivery Lead
at Optum (UHG)
 Previously at IBM
and FAO of the UN
 Current fields of expertise are Big Data,
ML/DL and DevOps
 Past experience in JVM languages
development (Java, Groovy, Scala), test
automation, CI/CD

Something About Me
 Author of the upcoming book
“Hands-on Deep Learning
with Apache Spark”
 I love preparing
home-made pizza

Agenda
 Challenges of Data Ingestion from the Edge
 Streamsets Data Collector
 Features
 Core concepts
 Streamsets Data Collector Edge
 Overview
 Demos
 SDC (RDBMS + Kafka)
 SDC Edge (Android + InfluxDb + Grafana)
 Q & A

Challenges of Data Ingestion
from the Edge
 Every day increasing amount of data being generated
from outside the data center or cloud.
 New scenarios (Industry 4.0, IoT, smartphones).
 It isn’t always easy to get data out of source systems or
perform analytics right where it’s generated.
 Getting data into central big data systems is an arduous
task involving a large number of disjointed, poorly
instrumented and often hand coded technologies.
 Limited resources (memory, CPU, connectivity).
 Unexpected changes (Data Drift).
 Live management of thousands of edge pipelines:
difficult to operate at scale.

A New Way for Data Ingestion
 Reduction/elimination of redundancy:
 Data redundancy brings significant costs in terms of storage.
 It has impact on query performance.
 It makes the data scientists work harder (lots of useless info to
skip while building a model).
 Data synchronicity across different consumers.
 Control over data quality.
 The three phases of traditional ETL could happen on a
single system.
 Data Streaming
 Simpler architecture design and maintenance

What’s Streamsets Data
Collector (SDC)?
 It is a tool to design complex data flows with minimal
coding and the maximum flexibility.
 It provides real-time data flow statistics and metrics for
each flow stage.
 It provides automated error handling and alerting.
 It is easy to use (drag-and-drop from a web UI).
 It ensures zero-downtime when upgrading the
underlying infrastructure.
 It handles data serialization.
 It is Open Source.

SDC Use Cases
 Apache Kafka Enablement
 Connecting applications to Kafka without writing a single line
of code.
 Hadoop Ingestion
 Easy continuously data ingestion into Hadoop and its
surrounding ecosystem.
 Cloud Migration
 Data migrate onto or across cloud providers.
 Search Enablement
 Easy population of your search solution of choice with data
from any source.

SDC Core Concepts
 Origin
 Represents the source for the pipeline.
 Processor
 It's a stage that represents a type of data processing that you
want to perform.
 Destination
 Represents the target for a pipeline.
 Executor
 It’s a stage that triggers a task when it receives an event.

SDC Origins
 Cloud platforms
 Local and remote file
systems
 HTTP and REST API
 Kafka
 Hadoop
 Relational
Databases
 MQTT

SDC Destinations
 Cloud platforms
 Local and remote FS
 Cassandra
 Kafka
 Hadoop
eco-system
 Relational
Databases
 MQTT
 Search Engines

SDC Processors
 Field manipulators
 Lookup
 Expression evaluator
 Parsers
 Stream selector
 Script (JavaScript, Python, Jython, Groovy)
evaluators
 Spark evaluator
 Schema generator

SDC: other topics
 Performance
 Security
 CI/CD
 SDK
 REST API

What’s SDC Edge?
 It is an ultra lightweight agent that can run pipelines
designed in SDC to ship data in and out of systems.
 It is written in Go and compiles down to a <5MB
executable that has no dependencies.
 It is Open Source.
 No dependency on external IoT Gateways.
 Can perform routing and filtering logic on edge
pipelines (architected for Edge Analytics).
 It runs natively on different platforms:

What’s SDC Edge?
 It supports leading messaging protocols including
HTTP, MQTT, CoAP, and WebSockets.
 It can Detect and handle data drift.
 Multiple pipelines can run at the same time per agent.

SDC Edge Use Cases
 Internet of Things (IoT)
 Reliably ingest and apply machine learning and other analytic
techniques to data aggregated from huge populations of IoT
sensors and devices.
 Cybersecurity
 Ingest and apply advanced analytics to the vast quantities of
data collected across a corporate network in order to detect
imminent threats or attacks in progress.

Simplified IoT Architecture
From
To

Useful Links
Streamsets Data Collector docs:
https://streamsets.com/documentation/datacollector/latest/help/#datacollecto
Streamsets Data Collector on GitHub:
https://github.com/streamsets/datacollector
Streamsets Data Collector Edge docs:
https://streamsets.com/products/sdc-edge
Streamsets Data Collector Edge on GitHub:
https://github.com/streamsets/datacollector-edge
Sdc-user Google group:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Ask Streamsets: https://ask.streamsets.com/questions/

Wrap Up
Linkedin: https://ie.linkedin.com/in/giozzia
Twitter: @GuglielmoIozzia
Blog: googlielmo.blogspot.com
DZone: https://dzone.com/users/2532948/virtualramblas.html
Hands-On Deep Learning with Apache Spark:
https://www.packtpub.com/big-data-and-business-intelligence/hands-
deep-learning-apache-spark

Ultralight Data Movement for IoT with SDC Edge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ultralight Data Movement for IoT with SDC Edge

Similar to Ultralight Data Movement for IoT with SDC Edge (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Ultralight Data Movement for IoT with SDC Edge