SlideShare a Scribd company logo
1 of 23
Ultralight Data Movement
for IoT with SDC Edge
Presented by
Guglielmo Iozzia
Berlin, April 18th
2018
Something About Me
 Big Data Delivery Lead
at Optum (UHG)
 Previously at IBM
and FAO of the UN
 Current fields of expertise are Big Data,
ML/DL and DevOps
 Past experience in JVM languages
development (Java, Groovy, Scala), test
automation, CI/CD
Something About Me
 Author of the upcoming book
“Hands-on Deep Learning
with Apache Spark”
 I love preparing
home-made pizza
I the Dublin Tech Hub
Agenda
 Challenges of Data Ingestion from the Edge
 Streamsets Data Collector
 Features
 Core concepts
 Streamsets Data Collector Edge
 Overview
 Demos
 SDC (RDBMS + Kafka)
 SDC Edge (Android + InfluxDb + Grafana)
 Q & A
Challenges of Data Ingestion
from the Edge
 Every day increasing amount of data being generated
from outside the data center or cloud.
 New scenarios (Industry 4.0, IoT, smartphones).
 It isn’t always easy to get data out of source systems or
perform analytics right where it’s generated.
 Getting data into central big data systems is an arduous
task involving a large number of disjointed, poorly
instrumented and often hand coded technologies.
 Limited resources (memory, CPU, connectivity).
 Unexpected changes (Data Drift).
 Live management of thousands of edge pipelines:
difficult to operate at scale.
A New Way for Data Ingestion
 Reduction/elimination of redundancy:
 Data redundancy brings significant costs in terms of storage.
 It has impact on query performance.
 It makes the data scientists work harder (lots of useless info to
skip while building a model).
 Data synchronicity across different consumers.
 Control over data quality.
 The three phases of traditional ETL could happen on a
single system.
 Data Streaming
 Simpler architecture design and maintenance
What’s Streamsets Data
Collector (SDC)?
 It is a tool to design complex data flows with minimal
coding and the maximum flexibility.
 It provides real-time data flow statistics and metrics for
each flow stage.
 It provides automated error handling and alerting.
 It is easy to use (drag-and-drop from a web UI).
 It ensures zero-downtime when upgrading the
underlying infrastructure.
 It handles data serialization.
 It is Open Source.
SDC Use Cases
 Apache Kafka Enablement
 Connecting applications to Kafka without writing a single line
of code.
 Hadoop Ingestion
 Easy continuously data ingestion into Hadoop and its
surrounding ecosystem.
 Cloud Migration
 Data migrate onto or across cloud providers.
 Search Enablement
 Easy population of your search solution of choice with data
from any source.
SDC Core Concepts
 Origin
 Represents the source for the pipeline.
 Processor
 It's a stage that represents a type of data processing that you
want to perform.
 Destination
 Represents the target for a pipeline.
 Executor
 It’s a stage that triggers a task when it receives an event.
SDC Origins
 Cloud platforms
 Local and remote file
systems
 HTTP and REST API
 Kafka
 Hadoop
 Relational
Databases
 MQTT
SDC Destinations
 Cloud platforms
 Local and remote FS
 Cassandra
 Kafka
 Hadoop
eco-system
 Relational
Databases
 MQTT
 Search Engines
SDC Processors
 Field manipulators
 Lookup
 Expression evaluator
 Parsers
 Stream selector
 Script (JavaScript, Python, Jython, Groovy)
evaluators
 Spark evaluator
 Schema generator
SDC: other topics
 Performance
 Security
 CI/CD
 SDK
 REST API
SDC Demo
What’s SDC Edge?
 It is an ultra lightweight agent that can run pipelines
designed in SDC to ship data in and out of systems.
 It is written in Go and compiles down to a <5MB
executable that has no dependencies.
 It is Open Source.
 No dependency on external IoT Gateways.
 Can perform routing and filtering logic on edge
pipelines (architected for Edge Analytics).
 It runs natively on different platforms:
What’s SDC Edge?
 It supports leading messaging protocols including
HTTP, MQTT, CoAP, and WebSockets.
 It can Detect and handle data drift.
 Multiple pipelines can run at the same time per agent.
SDC Edge Use Cases
 Internet of Things (IoT)
 Reliably ingest and apply machine learning and other analytic
techniques to data aggregated from huge populations of IoT
sensors and devices.
 Cybersecurity
 Ingest and apply advanced analytics to the vast quantities of
data collected across a corporate network in order to detect
imminent threats or attacks in progress.
Simplified IoT Architecture
From
To
SDC Edge Demo
Useful Links
Streamsets Data Collector docs:
https://streamsets.com/documentation/datacollector/latest/help/#datacollecto
Streamsets Data Collector on GitHub:
https://github.com/streamsets/datacollector
Streamsets Data Collector Edge docs:
https://streamsets.com/products/sdc-edge
Streamsets Data Collector Edge on GitHub:
https://github.com/streamsets/datacollector-edge
Sdc-user Google group:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Ask Streamsets: https://ask.streamsets.com/questions/
Q & A
Wrap Up
Linkedin: https://ie.linkedin.com/in/giozzia
Twitter: @GuglielmoIozzia
Blog: googlielmo.blogspot.com
DZone: https://dzone.com/users/2532948/virtualramblas.html
Hands-On Deep Learning with Apache Spark:
https://www.packtpub.com/big-data-and-business-intelligence/hands-
deep-learning-apache-spark

More Related Content

What's hot

Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIDataWorks Summit
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorDataWorks Summit
 
Securing and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industrySecuring and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industryDataWorks Summit
 
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...DataWorks Summit
 
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...DataWorks Summit
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...DataWorks Summit
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSDataWorks Summit
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...DataWorks Summit
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...DataWorks Summit
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsDataWorks Summit
 
Synchronicity of a distributed financial system
Synchronicity of a distributed financial systemSynchronicity of a distributed financial system
Synchronicity of a distributed financial systemDataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationDataWorks Summit
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...DataWorks Summit
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Data Con LA
 

What's hot (20)

Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AI
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the door
 
Securing and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industrySecuring and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industry
 
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
 
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Synchronicity of a distributed financial system
Synchronicity of a distributed financial systemSynchronicity of a distributed financial system
Synchronicity of a distributed financial system
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
 

Similar to Ultralight Data Movement for IoT with SDC Edge

Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumUltralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumData Driven Innovation
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
 
Role of cloud and analytics in IoT
Role of cloud and analytics in IoTRole of cloud and analytics in IoT
Role of cloud and analytics in IoTSelvaraj Kesavan
 
IT TRENDS AND PERSPECTIVES 2016
IT TRENDS AND PERSPECTIVES 2016IT TRENDS AND PERSPECTIVES 2016
IT TRENDS AND PERSPECTIVES 2016Vaidheswaran CS
 
Course Notes-Unit 5.ppt
Course Notes-Unit 5.pptCourse Notes-Unit 5.ppt
Course Notes-Unit 5.pptSafaM3
 
Platform Deep Dive
Platform Deep DivePlatform Deep Dive
Platform Deep DiveConrad23
 
Dell Digital Transformation Through AI and Data Analytics Webinar
Dell Digital Transformation Through AI and  Data Analytics WebinarDell Digital Transformation Through AI and  Data Analytics Webinar
Dell Digital Transformation Through AI and Data Analytics WebinarBill Wong
 
Open Source Edge Computing Platforms - Overview
Open Source Edge Computing Platforms - OverviewOpen Source Edge Computing Platforms - Overview
Open Source Edge Computing Platforms - OverviewKrishna-Kumar
 
SolidSource Portfolio
SolidSource PortfolioSolidSource Portfolio
SolidSource PortfolioLucian Voinea
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learnconfluent
 
Building a reliable and scalable IoT platform with MongoDB and HiveMQ
Building a reliable and scalable IoT platform with MongoDB and HiveMQBuilding a reliable and scalable IoT platform with MongoDB and HiveMQ
Building a reliable and scalable IoT platform with MongoDB and HiveMQDominik Obermaier
 
Azure IoT services - overview, SenZations 2015
Azure IoT services - overview, SenZations 2015Azure IoT services - overview, SenZations 2015
Azure IoT services - overview, SenZations 2015SenZations Summer School
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresIvo Andreev
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowNeo4j
 

Similar to Ultralight Data Movement for IoT with SDC Edge (20)

Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumUltralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
 
Designing Internet of things
Designing Internet of thingsDesigning Internet of things
Designing Internet of things
 
Role of cloud and analytics in IoT
Role of cloud and analytics in IoTRole of cloud and analytics in IoT
Role of cloud and analytics in IoT
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
IT TRENDS AND PERSPECTIVES 2016
IT TRENDS AND PERSPECTIVES 2016IT TRENDS AND PERSPECTIVES 2016
IT TRENDS AND PERSPECTIVES 2016
 
Course Notes-Unit 5.ppt
Course Notes-Unit 5.pptCourse Notes-Unit 5.ppt
Course Notes-Unit 5.ppt
 
Platform Deep Dive
Platform Deep DivePlatform Deep Dive
Platform Deep Dive
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Dell Digital Transformation Through AI and Data Analytics Webinar
Dell Digital Transformation Through AI and  Data Analytics WebinarDell Digital Transformation Through AI and  Data Analytics Webinar
Dell Digital Transformation Through AI and Data Analytics Webinar
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Open Source Edge Computing Platforms - Overview
Open Source Edge Computing Platforms - OverviewOpen Source Edge Computing Platforms - Overview
Open Source Edge Computing Platforms - Overview
 
SolidSource Portfolio
SolidSource PortfolioSolidSource Portfolio
SolidSource Portfolio
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learn
 
Building a reliable and scalable IoT platform with MongoDB and HiveMQ
Building a reliable and scalable IoT platform with MongoDB and HiveMQBuilding a reliable and scalable IoT platform with MongoDB and HiveMQ
Building a reliable and scalable IoT platform with MongoDB and HiveMQ
 
Azure IoT services - overview, SenZations 2015
Azure IoT services - overview, SenZations 2015Azure IoT services - overview, SenZations 2015
Azure IoT services - overview, SenZations 2015
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Ultralight Data Movement for IoT with SDC Edge

  • 1. Ultralight Data Movement for IoT with SDC Edge Presented by Guglielmo Iozzia Berlin, April 18th 2018
  • 2. Something About Me  Big Data Delivery Lead at Optum (UHG)  Previously at IBM and FAO of the UN  Current fields of expertise are Big Data, ML/DL and DevOps  Past experience in JVM languages development (Java, Groovy, Scala), test automation, CI/CD
  • 3. Something About Me  Author of the upcoming book “Hands-on Deep Learning with Apache Spark”  I love preparing home-made pizza
  • 4. I the Dublin Tech Hub
  • 5. Agenda  Challenges of Data Ingestion from the Edge  Streamsets Data Collector  Features  Core concepts  Streamsets Data Collector Edge  Overview  Demos  SDC (RDBMS + Kafka)  SDC Edge (Android + InfluxDb + Grafana)  Q & A
  • 6. Challenges of Data Ingestion from the Edge  Every day increasing amount of data being generated from outside the data center or cloud.  New scenarios (Industry 4.0, IoT, smartphones).  It isn’t always easy to get data out of source systems or perform analytics right where it’s generated.  Getting data into central big data systems is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.  Limited resources (memory, CPU, connectivity).  Unexpected changes (Data Drift).  Live management of thousands of edge pipelines: difficult to operate at scale.
  • 7. A New Way for Data Ingestion  Reduction/elimination of redundancy:  Data redundancy brings significant costs in terms of storage.  It has impact on query performance.  It makes the data scientists work harder (lots of useless info to skip while building a model).  Data synchronicity across different consumers.  Control over data quality.  The three phases of traditional ETL could happen on a single system.  Data Streaming  Simpler architecture design and maintenance
  • 8. What’s Streamsets Data Collector (SDC)?  It is a tool to design complex data flows with minimal coding and the maximum flexibility.  It provides real-time data flow statistics and metrics for each flow stage.  It provides automated error handling and alerting.  It is easy to use (drag-and-drop from a web UI).  It ensures zero-downtime when upgrading the underlying infrastructure.  It handles data serialization.  It is Open Source.
  • 9. SDC Use Cases  Apache Kafka Enablement  Connecting applications to Kafka without writing a single line of code.  Hadoop Ingestion  Easy continuously data ingestion into Hadoop and its surrounding ecosystem.  Cloud Migration  Data migrate onto or across cloud providers.  Search Enablement  Easy population of your search solution of choice with data from any source.
  • 10. SDC Core Concepts  Origin  Represents the source for the pipeline.  Processor  It's a stage that represents a type of data processing that you want to perform.  Destination  Represents the target for a pipeline.  Executor  It’s a stage that triggers a task when it receives an event.
  • 11. SDC Origins  Cloud platforms  Local and remote file systems  HTTP and REST API  Kafka  Hadoop  Relational Databases  MQTT
  • 12. SDC Destinations  Cloud platforms  Local and remote FS  Cassandra  Kafka  Hadoop eco-system  Relational Databases  MQTT  Search Engines
  • 13. SDC Processors  Field manipulators  Lookup  Expression evaluator  Parsers  Stream selector  Script (JavaScript, Python, Jython, Groovy) evaluators  Spark evaluator  Schema generator
  • 14. SDC: other topics  Performance  Security  CI/CD  SDK  REST API
  • 16. What’s SDC Edge?  It is an ultra lightweight agent that can run pipelines designed in SDC to ship data in and out of systems.  It is written in Go and compiles down to a <5MB executable that has no dependencies.  It is Open Source.  No dependency on external IoT Gateways.  Can perform routing and filtering logic on edge pipelines (architected for Edge Analytics).  It runs natively on different platforms:
  • 17. What’s SDC Edge?  It supports leading messaging protocols including HTTP, MQTT, CoAP, and WebSockets.  It can Detect and handle data drift.  Multiple pipelines can run at the same time per agent.
  • 18. SDC Edge Use Cases  Internet of Things (IoT)  Reliably ingest and apply machine learning and other analytic techniques to data aggregated from huge populations of IoT sensors and devices.  Cybersecurity  Ingest and apply advanced analytics to the vast quantities of data collected across a corporate network in order to detect imminent threats or attacks in progress.
  • 21. Useful Links Streamsets Data Collector docs: https://streamsets.com/documentation/datacollector/latest/help/#datacollecto Streamsets Data Collector on GitHub: https://github.com/streamsets/datacollector Streamsets Data Collector Edge docs: https://streamsets.com/products/sdc-edge Streamsets Data Collector Edge on GitHub: https://github.com/streamsets/datacollector-edge Sdc-user Google group: https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user Ask Streamsets: https://ask.streamsets.com/questions/
  • 22. Q & A
  • 23. Wrap Up Linkedin: https://ie.linkedin.com/in/giozzia Twitter: @GuglielmoIozzia Blog: googlielmo.blogspot.com DZone: https://dzone.com/users/2532948/virtualramblas.html Hands-On Deep Learning with Apache Spark: https://www.packtpub.com/big-data-and-business-intelligence/hands- deep-learning-apache-spark