BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Reliable Data Ingestion in Big Data/IoT
Guido Schmutz
@gschmutz
Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer, Software Architect for Java, SOA & Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Reliable Data Ingestion in Big Data/IoT
Our company.
Reliable Data Ingestion in Big Data/IoT
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and
technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
Reliable Data Ingestion in Big Data/IoT
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers
Reliable Data Ingestion in Big Data/IoT
Technology on its own won't help you.
You need to know how to use it properly.
Reliable Data Ingestion in Big Data/IoT
Introduction
Big Data Definition (4 Vs)
+	Time	to	action	?	– Big	Data	+	Real-Time	=	Stream	Processing
Characteristics	of	Big	Data:	Its	Volume,	Velocity	
and	Variety	in	combination
Reliable Data Ingestion in Big Data/IoT
Ever increasing volume and velocity - Internet of Things
(IoT) Wave
Internet of Things (IoT): Enabling
communication between devices,
people & processes to exchange
useful information & knowledge
that create value for humans
Term was first proposed by Kevin
Ashton in 1999
Source:	The	Economist
Source:	Ericsson,	June	2016
Reliable Data Ingestion in Big Data/IoT
What is Data Ingestion?
Acquiring data as it is produced from Data Source(s)
Transforming into a consumable form
Delivering the transformed data to the consuming system(s)
The challenge: Doing this continuously and at scale across a wide variety of
sources and consuming systems
Ingress and Egress are to other terms referring to data movement in and out
of a system
Reliable Data Ingestion in Big Data/IoT
Hadoop Clusterd
Hadoop Cluster
Hadoop Cluster
Lambda Architecture for Big Data
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Event
Hub
Event
Hub
Event
Hub
NoSQL
Parallel
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI	Tools
Enterprise Data
Warehouse
Search
Online	&	Mobile	
Apps
SQL Import
Weather
Data
Reliable Data Ingestion in Big Data/IoT
SQL Import
Hadoop Clusterd
Hadoop Cluster
Hadoop Cluster
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Weather
Data
Mobile
Apps
Batch Analytics
Streaming Analytics
Event
Hub
Event
Hub
Event
Hub
NoSQL
Parallel
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI	Tools
Enterprise Data
Warehouse
Search
Online	&	Mobile	
Apps
Integrate Sanitize / Normalize Deliver
IoT GW
MQTT	Broker
Continuous Ingestion -
DataFlow Pipelines
DB	Source
Big	Data
Log
Stream	
Processing
IoT Sensor
Event	Hub
Topic
Topic
REST
Topic
IoT GW
CDC	GW
Connect
CDC
DB	Source
Log CDC
Native
IoT Sensor
IoT Sensor
12
Dataflow	GW
Topic
Topic
Queue
Messaging	GW
Topic
Dataflow	GW
Dataflow
Topic
REST
12
File	Source
Log
Log
Log
Social
Native
Reliable Data Ingestion in Big Data/IoT
DataFlow Pipeline
Reliable Data Ingestion in Big Data/IoT
• Flow-based ”programming”
• Ingest Data from various sources
• Extract – Transform – Load
• High-Throughput, straight-through
data flows
• Data Lineage
• Batch- or Stream-Processing
• Visual coding with flow editor
• Event Stream Processing (ESP) but
not Complex Event Processing (CEP)
Source: Confluent
SQL Polling
Change Data Capture (CDC)
File Stream (File Tailing)
File Stream (Appender)
Continuous Ingestion –
Integrating data sources
Sensor Stream
Reliable Data Ingestion in Big Data/IoT
Ingestion with/without Transformation?
Reliable Data Ingestion in Big Data/IoT
Zero Transformation
• No transformation, plain ingest, no
schema validation
• Keep the original format – Text,
CSV, …
• Allows to store data that may have
errors in the schema
Format Transformation
• Prefer name of Format Translation
• Simply change the format
• Change format from Text to Avro
• Does schema validation
Enrichment Transformation
• Add new data to the message
• Do not change existing values
• Convert a value from one system to
another and add it to the message
Value Transformation
• Replaces values in the message
• Convert a value from one system to
another and change the value in-place
• Destroys the raw data!
Reliable Data Ingestion in Big Data/IoT
Challenges
Why is Data Ingestion Difficult?
Physical and Logical
Infrastructure changes
rapidly
Key Challenges:
Infrastructure Automation
Edge Deployment
Infrastructure Drift
Data Structures and
formats evolve and change
unexpectedly
Key Challenges:
Consumption Readiness
Corruption and Loss
Structure Drift
Data semantics change
with evolving applications
Key Challenges
Timely Intervention
System Consistency
Semantic Drift
Reliable Data Ingestion in Big Data/IoT
Source: Streamsets
Challenges for Ingesting Sensor Data
Reliable Data Ingestion in Big Data/IoT
Multitude of sensors
Real-Time Streaming
Multiple Firmware versions
Bad Data from damaged sensors
Regulatory Constraints
Data Quality
Source: Cloudera
Key Elements of Data Ingestion
Reliable Data Ingestion in Big Data/IoT
Idempotence
Batching (Bulk)
Data Transformation
Compression
Availability and Recoverability
Reliable Data Transfer and Data
Validation
Resource Consumption
Performance
Monitoring
Reliable Data Ingestion in Big Data/IoT
Implementing Event Hub – Apache
Kafka
How to implement an Event Hub?
Apache Kafka to the rescue
• Distributed publish-subscribe messaging system
• Designed for processing of high-volume, real time
activity stream data (logs, metrics, social media, …)
• Stateless (passive) architecture, offset-based
consumption
• Provides Topics, but does not implement JMS
standard
• Initially developed at LinkedIn, now part of Apache
• Peak Load on single cluster: 2 million messages/sec, 4.7
Gigabits/sec inbound, 15 Gigabits/sec outbound
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Reliable Data Ingestion in Big Data/IoT
Reliable Data Ingestion in Big Data/IoT
Implementing Data Flow
Apache Flume
distributed data collection service
gets flows of data (like logs) from their source
aggregates them to where they have to be
processed
Sources: files, syslog, avro, …
Sinks: HDFS files, HBase, …
Reliable Data Ingestion in Big Data/IoT
Source: Flume Documentation
Apache Sqoop
Reliable Data Ingestion in Big Data/IoT
• Sqoop exchanges data between an RDBMS and
Hadoop
• It can import all tables, single table, or a portion of a
table into HDFS
• Does this very efficiently via a Map-only MapReduce job
• Result is a directory in HDFS containing comma-
delimited text
• Scoop can also export data from HDFS back to the
database
$ sqoop import --connect jdbc:mysql://localhost/company 
--username twheeler --password bigsecret 
--warehouse-dir /mydata 
--table customers
Oracle GoldenGate
Reliable Data Ingestion in Big Data/IoT
• Provides low-impact change data
capture solution for Oracle and Non-
Oracle RDMBS
• Non-intrusive
• Low-Latency
• Open, modular Architecture
• Supports heterogeneous systems
• Oracle GoldenGate for Big Data
provides Hadoop and Kafka Support
Apache Kafka Connect
• a tool for scalably and reliably streaming
data between Apache Kafka and other
data systems
• is not an ETL framework
• Pre-build connectors available for Data
Source and Data Sinks
• JDBC (Source)
• Oracle GoldenGate (Source)
• MQTT (Source)
• HDFS (Sink)
• Elasticsearch (Sink)
• MongoDB (Sink)
• Cassandra (Source & Sink)
Reliable Data Ingestion in Big Data/IoT
Source: Confluent
Apache NiFi & MiNiFi
• Originated at NSA as Niagarafiles
• Open sourced December 2014, Apache
TLP July 2015
• Opaque, file-oriented payload
• Distributed system of processors with
centralized control
• Based on flow-based programming
concepts
• Data Provenance
• Web-based user interface
• Apache MiNiFi focuses on the collection of
data at the source of its creation
Reliable Data Ingestion in Big Data/IoT
StreamSets Data Collector
Founded by ex-Cloudera, Informatica
employees
Continuous open source, intent-driven, big
data ingest
Visible, record-oriented approach fixes
combinatorial explosion
Batch or stream processing
• Standalone, Spark cluster, MapReduce
cluster
IDE for pipeline development by ‘civilians’
Relatively new - first public release
September 2015
So far, vast majority of commits are from
StreamSets staff
Reliable Data Ingestion in Big Data/IoT
Other Alternatives
Reliable Data Ingestion in Big Data/IoT
• Spring Cloud Data Flow
• Node-RED
• Project Flogo
• Oracle Streaming Analytics
• Spark Streaming
• …
Reliable Data Ingestion in Big Data/IoT
What about existing Integration
Platforms?
Oracle’s Service Bus as a consumer of Kafka
Service	Bus	12c
Cloud	
Apps
Business	
Service
Cloud
Proxy
Service
Kafka
Cloud	
API
Mobile	
Apps Pipeline
Routing
Kafka
Sensor	/	
IoT
Web	Apps
Business	
Service
REST
Business	
Service
WSDL
Backend
Apps
REST
Backend	
Apps
WSDL
Proxy
Service
Kafka
Pipeline
Routing
Database
DB CDC
Stream	
Processing
Reliable Data Ingestion in Big Data/IoT
Oracle’s Service Bus as a producer to Kafka
Service	Bus	12c
Cloud	
Apps
Business	
Service
Cloud
Proxy
Service
REST
Cloud	
API
Mobile	
Apps Pipeline
Routing
Sensor	/	
IoT
Web	Apps
Business	
Service
REST
Business	
Service
Kafka
Backend
Apps
REST
Proxy
Service
SOAP
Pipeline
Routing
Reliable Data Ingestion in Big Data/IoT
Kafka
Backend	
Apps
SOA	/ BPM
Hybrid Integration Platforms (HIP) needed
Reliable Data Ingestion in Big Data/IoT
Source: Gartner
Trivadis @ DOAG 2016
Booth: 3rd Floor – next to the escalator
Know how, T-Shirts, Contest and Trivadis Power to go
We look forward to your visit
Because with Trivadis you always win !
Reliable Data Ingestion in Big Data/IoT

Reliable Data Intestion in BigData / IoT

  • 1.
    BASEL BERN BRUGGDÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Reliable Data Ingestion in Big Data/IoT Guido Schmutz @gschmutz
  • 2.
    Guido Schmutz Working forTrivadis for more than 19 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer, Software Architect for Java, SOA & Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz Reliable Data Ingestion in Big Data/IoT
  • 3.
    Our company. Reliable DataIngestion in Big Data/IoT Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. O P E R A T I O N
  • 4.
    COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600specialists and IT experts in your region. Reliable Data Ingestion in Big Data/IoT 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  • 5.
    Reliable Data Ingestionin Big Data/IoT Technology on its own won't help you. You need to know how to use it properly.
  • 6.
    Reliable Data Ingestionin Big Data/IoT Introduction
  • 7.
    Big Data Definition(4 Vs) + Time to action ? – Big Data + Real-Time = Stream Processing Characteristics of Big Data: Its Volume, Velocity and Variety in combination Reliable Data Ingestion in Big Data/IoT
  • 8.
    Ever increasing volumeand velocity - Internet of Things (IoT) Wave Internet of Things (IoT): Enabling communication between devices, people & processes to exchange useful information & knowledge that create value for humans Term was first proposed by Kevin Ashton in 1999 Source: The Economist Source: Ericsson, June 2016 Reliable Data Ingestion in Big Data/IoT
  • 9.
    What is DataIngestion? Acquiring data as it is produced from Data Source(s) Transforming into a consumable form Delivering the transformed data to the consuming system(s) The challenge: Doing this continuously and at scale across a wide variety of sources and consuming systems Ingress and Egress are to other terms referring to data movement in and out of a system Reliable Data Ingestion in Big Data/IoT
  • 10.
    Hadoop Clusterd Hadoop Cluster HadoopCluster Lambda Architecture for Big Data Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Event Hub Event Hub Event Hub NoSQL Parallel Processing Distributed Filesystem Stream Analytics NoSQL Reference / Models SQL Search Dashboard BI Tools Enterprise Data Warehouse Search Online & Mobile Apps SQL Import Weather Data Reliable Data Ingestion in Big Data/IoT
  • 11.
    SQL Import Hadoop Clusterd HadoopCluster Hadoop Cluster Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Weather Data Mobile Apps Batch Analytics Streaming Analytics Event Hub Event Hub Event Hub NoSQL Parallel Processing Distributed Filesystem Stream Analytics NoSQL Reference / Models SQL Search Dashboard BI Tools Enterprise Data Warehouse Search Online & Mobile Apps Integrate Sanitize / Normalize Deliver
  • 12.
    IoT GW MQTT Broker Continuous Ingestion- DataFlow Pipelines DB Source Big Data Log Stream Processing IoT Sensor Event Hub Topic Topic REST Topic IoT GW CDC GW Connect CDC DB Source Log CDC Native IoT Sensor IoT Sensor 12 Dataflow GW Topic Topic Queue Messaging GW Topic Dataflow GW Dataflow Topic REST 12 File Source Log Log Log Social Native Reliable Data Ingestion in Big Data/IoT
  • 13.
    DataFlow Pipeline Reliable DataIngestion in Big Data/IoT • Flow-based ”programming” • Ingest Data from various sources • Extract – Transform – Load • High-Throughput, straight-through data flows • Data Lineage • Batch- or Stream-Processing • Visual coding with flow editor • Event Stream Processing (ESP) but not Complex Event Processing (CEP) Source: Confluent
  • 14.
    SQL Polling Change DataCapture (CDC) File Stream (File Tailing) File Stream (Appender) Continuous Ingestion – Integrating data sources Sensor Stream Reliable Data Ingestion in Big Data/IoT
  • 15.
    Ingestion with/without Transformation? ReliableData Ingestion in Big Data/IoT Zero Transformation • No transformation, plain ingest, no schema validation • Keep the original format – Text, CSV, … • Allows to store data that may have errors in the schema Format Transformation • Prefer name of Format Translation • Simply change the format • Change format from Text to Avro • Does schema validation Enrichment Transformation • Add new data to the message • Do not change existing values • Convert a value from one system to another and add it to the message Value Transformation • Replaces values in the message • Convert a value from one system to another and change the value in-place • Destroys the raw data!
  • 16.
    Reliable Data Ingestionin Big Data/IoT Challenges
  • 17.
    Why is DataIngestion Difficult? Physical and Logical Infrastructure changes rapidly Key Challenges: Infrastructure Automation Edge Deployment Infrastructure Drift Data Structures and formats evolve and change unexpectedly Key Challenges: Consumption Readiness Corruption and Loss Structure Drift Data semantics change with evolving applications Key Challenges Timely Intervention System Consistency Semantic Drift Reliable Data Ingestion in Big Data/IoT Source: Streamsets
  • 18.
    Challenges for IngestingSensor Data Reliable Data Ingestion in Big Data/IoT Multitude of sensors Real-Time Streaming Multiple Firmware versions Bad Data from damaged sensors Regulatory Constraints Data Quality Source: Cloudera
  • 19.
    Key Elements ofData Ingestion Reliable Data Ingestion in Big Data/IoT Idempotence Batching (Bulk) Data Transformation Compression Availability and Recoverability Reliable Data Transfer and Data Validation Resource Consumption Performance Monitoring
  • 20.
    Reliable Data Ingestionin Big Data/IoT Implementing Event Hub – Apache Kafka
  • 21.
    How to implementan Event Hub? Apache Kafka to the rescue • Distributed publish-subscribe messaging system • Designed for processing of high-volume, real time activity stream data (logs, metrics, social media, …) • Stateless (passive) architecture, offset-based consumption • Provides Topics, but does not implement JMS standard • Initially developed at LinkedIn, now part of Apache • Peak Load on single cluster: 2 million messages/sec, 4.7 Gigabits/sec inbound, 15 Gigabits/sec outbound Kafka Cluster Consumer Consumer Consumer Producer Producer Producer Reliable Data Ingestion in Big Data/IoT
  • 22.
    Reliable Data Ingestionin Big Data/IoT Implementing Data Flow
  • 23.
    Apache Flume distributed datacollection service gets flows of data (like logs) from their source aggregates them to where they have to be processed Sources: files, syslog, avro, … Sinks: HDFS files, HBase, … Reliable Data Ingestion in Big Data/IoT Source: Flume Documentation
  • 24.
    Apache Sqoop Reliable DataIngestion in Big Data/IoT • Sqoop exchanges data between an RDBMS and Hadoop • It can import all tables, single table, or a portion of a table into HDFS • Does this very efficiently via a Map-only MapReduce job • Result is a directory in HDFS containing comma- delimited text • Scoop can also export data from HDFS back to the database $ sqoop import --connect jdbc:mysql://localhost/company --username twheeler --password bigsecret --warehouse-dir /mydata --table customers
  • 25.
    Oracle GoldenGate Reliable DataIngestion in Big Data/IoT • Provides low-impact change data capture solution for Oracle and Non- Oracle RDMBS • Non-intrusive • Low-Latency • Open, modular Architecture • Supports heterogeneous systems • Oracle GoldenGate for Big Data provides Hadoop and Kafka Support
  • 26.
    Apache Kafka Connect •a tool for scalably and reliably streaming data between Apache Kafka and other data systems • is not an ETL framework • Pre-build connectors available for Data Source and Data Sinks • JDBC (Source) • Oracle GoldenGate (Source) • MQTT (Source) • HDFS (Sink) • Elasticsearch (Sink) • MongoDB (Sink) • Cassandra (Source & Sink) Reliable Data Ingestion in Big Data/IoT Source: Confluent
  • 27.
    Apache NiFi &MiNiFi • Originated at NSA as Niagarafiles • Open sourced December 2014, Apache TLP July 2015 • Opaque, file-oriented payload • Distributed system of processors with centralized control • Based on flow-based programming concepts • Data Provenance • Web-based user interface • Apache MiNiFi focuses on the collection of data at the source of its creation Reliable Data Ingestion in Big Data/IoT
  • 28.
    StreamSets Data Collector Foundedby ex-Cloudera, Informatica employees Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Batch or stream processing • Standalone, Spark cluster, MapReduce cluster IDE for pipeline development by ‘civilians’ Relatively new - first public release September 2015 So far, vast majority of commits are from StreamSets staff Reliable Data Ingestion in Big Data/IoT
  • 29.
    Other Alternatives Reliable DataIngestion in Big Data/IoT • Spring Cloud Data Flow • Node-RED • Project Flogo • Oracle Streaming Analytics • Spark Streaming • …
  • 30.
    Reliable Data Ingestionin Big Data/IoT What about existing Integration Platforms?
  • 31.
    Oracle’s Service Busas a consumer of Kafka Service Bus 12c Cloud Apps Business Service Cloud Proxy Service Kafka Cloud API Mobile Apps Pipeline Routing Kafka Sensor / IoT Web Apps Business Service REST Business Service WSDL Backend Apps REST Backend Apps WSDL Proxy Service Kafka Pipeline Routing Database DB CDC Stream Processing Reliable Data Ingestion in Big Data/IoT
  • 32.
    Oracle’s Service Busas a producer to Kafka Service Bus 12c Cloud Apps Business Service Cloud Proxy Service REST Cloud API Mobile Apps Pipeline Routing Sensor / IoT Web Apps Business Service REST Business Service Kafka Backend Apps REST Proxy Service SOAP Pipeline Routing Reliable Data Ingestion in Big Data/IoT Kafka Backend Apps SOA / BPM
  • 33.
    Hybrid Integration Platforms(HIP) needed Reliable Data Ingestion in Big Data/IoT Source: Gartner
  • 34.
    Trivadis @ DOAG2016 Booth: 3rd Floor – next to the escalator Know how, T-Shirts, Contest and Trivadis Power to go We look forward to your visit Because with Trivadis you always win ! Reliable Data Ingestion in Big Data/IoT