Edge computing and the Internet of Things bring great promise, but often just getting data from the edge requires moving mountains. Let's learn how to make edge data ingestion and analytics easier using StreamSets Data Collector edge, an ultralight, platform independent and small-footprint Open Source solution written in Go for streaming data from resource-constrained sensors and personal devices (like medical equipment or smartphones) to Apache Kafka, Amazon Kinesis and many others. This talk includes an overview of the SDC Edge main features, supported protocols and available processors for data transformation, insights on how it solves some challenges of traditional approaches to data ingestion, pipeline design basics, a walk-through some practical applications (Android devices and Raspberry Pi) and its integration with other technologies such as Streamsets Data Collector, Apache Kafka, Apache Hadoop, InfluxDB and Grafana. The goal here is to make attendees ready to quickly become IoT data intake and SDC Edge Ninjas.
Speaker
Guglielmo Iozzia, Big Data Delivery Manager, Optum (United Health)
2. Something About Me
Big Data Delivery Lead
at Optum (UHG)
Previously at IBM
and FAO of the UN
Current fields of expertise are Big Data,
ML/DL and DevOps
Past experience in JVM languages
development (Java, Groovy, Scala), test
automation, CI/CD
3. Something About Me
Author of the upcoming book
“Hands-on Deep Learning
with Apache Spark”
I love preparing
home-made pizza
5. Agenda
Challenges of Data Ingestion from the Edge
Streamsets Data Collector
Features
Core concepts
Streamsets Data Collector Edge
Overview
Demos
SDC (RDBMS + Kafka)
SDC Edge (Android + InfluxDb + Grafana)
Q & A
6. Challenges of Data Ingestion
from the Edge
Every day increasing amount of data being generated
from outside the data center or cloud.
New scenarios (Industry 4.0, IoT, smartphones).
It isn’t always easy to get data out of source systems or
perform analytics right where it’s generated.
Getting data into central big data systems is an arduous
task involving a large number of disjointed, poorly
instrumented and often hand coded technologies.
Limited resources (memory, CPU, connectivity).
Unexpected changes (Data Drift).
Live management of thousands of edge pipelines:
difficult to operate at scale.
7. A New Way for Data Ingestion
Reduction/elimination of redundancy:
Data redundancy brings significant costs in terms of storage.
It has impact on query performance.
It makes the data scientists work harder (lots of useless info to
skip while building a model).
Data synchronicity across different consumers.
Control over data quality.
The three phases of traditional ETL could happen on a
single system.
Data Streaming
Simpler architecture design and maintenance
8. What’s Streamsets Data
Collector (SDC)?
It is a tool to design complex data flows with minimal
coding and the maximum flexibility.
It provides real-time data flow statistics and metrics for
each flow stage.
It provides automated error handling and alerting.
It is easy to use (drag-and-drop from a web UI).
It ensures zero-downtime when upgrading the
underlying infrastructure.
It handles data serialization.
It is Open Source.
9. SDC Use Cases
Apache Kafka Enablement
Connecting applications to Kafka without writing a single line
of code.
Hadoop Ingestion
Easy continuously data ingestion into Hadoop and its
surrounding ecosystem.
Cloud Migration
Data migrate onto or across cloud providers.
Search Enablement
Easy population of your search solution of choice with data
from any source.
10. SDC Core Concepts
Origin
Represents the source for the pipeline.
Processor
It's a stage that represents a type of data processing that you
want to perform.
Destination
Represents the target for a pipeline.
Executor
It’s a stage that triggers a task when it receives an event.
11. SDC Origins
Cloud platforms
Local and remote file
systems
HTTP and REST API
Kafka
Hadoop
Relational
Databases
MQTT
16. What’s SDC Edge?
It is an ultra lightweight agent that can run pipelines
designed in SDC to ship data in and out of systems.
It is written in Go and compiles down to a <5MB
executable that has no dependencies.
It is Open Source.
No dependency on external IoT Gateways.
Can perform routing and filtering logic on edge
pipelines (architected for Edge Analytics).
It runs natively on different platforms:
17. What’s SDC Edge?
It supports leading messaging protocols including
HTTP, MQTT, CoAP, and WebSockets.
It can Detect and handle data drift.
Multiple pipelines can run at the same time per agent.
18. SDC Edge Use Cases
Internet of Things (IoT)
Reliably ingest and apply machine learning and other analytic
techniques to data aggregated from huge populations of IoT
sensors and devices.
Cybersecurity
Ingest and apply advanced analytics to the vast quantities of
data collected across a corporate network in order to detect
imminent threats or attacks in progress.
21. Useful Links
Streamsets Data Collector docs:
https://streamsets.com/documentation/datacollector/latest/help/#datacollecto
Streamsets Data Collector on GitHub:
https://github.com/streamsets/datacollector
Streamsets Data Collector Edge docs:
https://streamsets.com/products/sdc-edge
Streamsets Data Collector Edge on GitHub:
https://github.com/streamsets/datacollector-edge
Sdc-user Google group:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Ask Streamsets: https://ask.streamsets.com/questions/