SlideShare a Scribd company logo
1 of 23
Download to read offline
Building a data pipeline to ingest
data into Hadoop in minutes
using Streamsets Data Collector
Guglielmo Iozzia,
Big Data Infrastructure Engineer @ IBM Ireland
Data Ingestion for Analytics: a real scenario
In the business area (cloud applications) to which my team belongs there were so
many questions to be answered. They were related to:
● Defect analysis
● Outage analysis
● Cyber-Security
“Data is the second
most important
thing in analytics”
Data Ingestion: multiple sources...
● Legacy systems
● DB2
● Lotus Domino
● MongoDB
● Application logs
● System logs
● New Relic
● Jenkins pipelines
● Testing tools output
● RESTful Services
… and so many tools available to get the data
What are we going to do with all those data?
Issues
● The need to collect data from multiple sources introduces redundancy, which
costs additional disk space and increases query times.
● A small team.
● Lack of skills and experience across the team (and the business area in
general) in managing Big Data tools.
● Low budget.
Alternatives
#1 Panic
Alternatives
#2 Cloning team members
Alternatives
#3 Find a smart way to simplify the data ingestion
process
A single tool needed...
● Design complex data flows with minimal coding and the maximum flexibility.
● Provide real-time data flow statistics, metrics for each flow stage.
● Automated error handling and alerting.
● Easy to use by everyone.
● Zero-downtime when upgrading the infrastructure due to logical isolation of
each flow stage.
● Open Source
… something like this
Streamsets Data Collector
Streamsets Data Collector
Streamsets Data Collector: supported origins
Streamsets Data Collector: available destinations
Streamsets Data Collector: available processors
● Base64 Field Decoder
● Base64 Field Encoder
● Expression Evaluator
● Field Converter
● JavaScript Evaluator
● JSON Parser
● Jython Evaluator
● Log Parser
● Stream Selector
● XML Parser
...and many others
Streamsets Data Collector
Demo
Streamsets DC: performance and reliability
● Two available execution modes: standalone or cluster
● Implemented in Java: so any performance best practice/recommendation for
Java applications applies here
● REST services for performance monitoring available
● Rules and alerts (metric and data both)
Streamsets Data Collector: security
● You can authenticate user accounts based on LDAP
● Authorization: the Data Collector provides several roles (admin, manager,
creator, guest)
● You can use Kerberos authentication to connect to origin and destination
systems
● Follow the usual security best practices in terms of iptables, networking, etc.
for Java web applications running on Linux machines.
Useful Links
Streamsets Data Collector:
https://streamsets.com/product/
Thanks!
My contacts:
Linkedin: https://ie.linkedin.com/in/giozzia
Blog: http://googlielmo.blogspot.ie/
Twitter: https://twitter.com/guglielmoiozzia
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

More Related Content

What's hot

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakePat Patterson
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfJohn Mulhall
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on EverythingDavid Phillips
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 

What's hot (20)

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 

Viewers also liked

A REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOPA REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOPEzekiel Tunde ADEBAMIWI
 
Propiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del SoftwarePropiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del Softwarejorge quispe
 
Informatica
InformaticaInformatica
Informaticaxgmikeh
 
JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016Ahmad butt
 
Presentación Rol del estudiante y de los tutores en la educación a distancia
Presentación  Rol del estudiante y de los tutores en la educación a distanciaPresentación  Rol del estudiante y de los tutores en la educación a distancia
Presentación Rol del estudiante y de los tutores en la educación a distanciaMONICA CALDERON
 
Derecho informatico
Derecho informaticoDerecho informatico
Derecho informaticojorge quispe
 
Contratacion Electronica & Contratacion Informatica
Contratacion Electronica & Contratacion InformaticaContratacion Electronica & Contratacion Informatica
Contratacion Electronica & Contratacion Informaticajorge quispe
 
Teletrabajo en la administración pública
Teletrabajo en la administración públicaTeletrabajo en la administración pública
Teletrabajo en la administración públicajorge quispe
 
Delitos informaticos
Delitos informaticosDelitos informaticos
Delitos informaticosjorge quispe
 
Analisis economico del derecho
Analisis economico del derechoAnalisis economico del derecho
Analisis economico del derechojorge quispe
 
Comercio electrónico
Comercio electrónicoComercio electrónico
Comercio electrónicojorge quispe
 
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
CISSP Prep: Ch 1: Security Governance Through Principles and PoliciesCISSP Prep: Ch 1: Security Governance Through Principles and Policies
CISSP Prep: Ch 1: Security Governance Through Principles and PoliciesSam Bowne
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Practical Malware Analysis: Ch 11: Malware Behavior
Practical Malware Analysis: Ch 11: Malware BehaviorPractical Malware Analysis: Ch 11: Malware Behavior
Practical Malware Analysis: Ch 11: Malware BehaviorSam Bowne
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgPractical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgSam Bowne
 

Viewers also liked (20)

Base de datos
Base de datosBase de datos
Base de datos
 
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOPA REPORT OF PROFESSIONAL TRAINING WORKSHOP
A REPORT OF PROFESSIONAL TRAINING WORKSHOP
 
Propiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del SoftwarePropiedad intelectual y Proteccion Juridica del Software
Propiedad intelectual y Proteccion Juridica del Software
 
Informatica
InformaticaInformatica
Informatica
 
JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016JSIL Print Media Presentation_PR Lipper Award 2016
JSIL Print Media Presentation_PR Lipper Award 2016
 
Presentación Rol del estudiante y de los tutores en la educación a distancia
Presentación  Rol del estudiante y de los tutores en la educación a distanciaPresentación  Rol del estudiante y de los tutores en la educación a distancia
Presentación Rol del estudiante y de los tutores en la educación a distancia
 
Housing Academy 2
Housing Academy 2Housing Academy 2
Housing Academy 2
 
Derecho informatico
Derecho informaticoDerecho informatico
Derecho informatico
 
Contratacion Electronica & Contratacion Informatica
Contratacion Electronica & Contratacion InformaticaContratacion Electronica & Contratacion Informatica
Contratacion Electronica & Contratacion Informatica
 
Chan
ChanChan
Chan
 
Teletrabajo en la administración pública
Teletrabajo en la administración públicaTeletrabajo en la administración pública
Teletrabajo en la administración pública
 
tieguy brochure
tieguy brochuretieguy brochure
tieguy brochure
 
Delitos informaticos
Delitos informaticosDelitos informaticos
Delitos informaticos
 
Tic
TicTic
Tic
 
Analisis economico del derecho
Analisis economico del derechoAnalisis economico del derecho
Analisis economico del derecho
 
Comercio electrónico
Comercio electrónicoComercio electrónico
Comercio electrónico
 
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
CISSP Prep: Ch 1: Security Governance Through Principles and PoliciesCISSP Prep: Ch 1: Security Governance Through Principles and Policies
CISSP Prep: Ch 1: Security Governance Through Principles and Policies
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Practical Malware Analysis: Ch 11: Malware Behavior
Practical Malware Analysis: Ch 11: Malware BehaviorPractical Malware Analysis: Ch 11: Malware Behavior
Practical Malware Analysis: Ch 11: Malware Behavior
 
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbgPractical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
Practical Malware Analysis: Ch 10: Kernel Debugging with WinDbg
 

Similar to Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...PerformanceVision (previously SecurActive)
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupJelena Zanko
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systemsNenad Bozic
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...In-Memory Computing Summit
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysisDhaval Mehta
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013ScaleOut Software
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Deepu K Sasidharan
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterJulien Dubois
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud Pradeep Natarajan
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16AppDynamics
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for StreamSplunk
 

Similar to Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector (20)

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Challenges of monitoring distributed systems
Challenges of monitoring distributed systemsChallenges of monitoring distributed systems
Challenges of monitoring distributed systems
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 

Recently uploaded

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

  • 1. Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland
  • 2. Data Ingestion for Analytics: a real scenario In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to: ● Defect analysis ● Outage analysis ● Cyber-Security
  • 3. “Data is the second most important thing in analytics”
  • 4. Data Ingestion: multiple sources... ● Legacy systems ● DB2 ● Lotus Domino ● MongoDB ● Application logs ● System logs ● New Relic ● Jenkins pipelines ● Testing tools output ● RESTful Services
  • 5. … and so many tools available to get the data
  • 6. What are we going to do with all those data?
  • 7. Issues ● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times. ● A small team. ● Lack of skills and experience across the team (and the business area in general) in managing Big Data tools. ● Low budget.
  • 10. Alternatives #3 Find a smart way to simplify the data ingestion process
  • 11. A single tool needed... ● Design complex data flows with minimal coding and the maximum flexibility. ● Provide real-time data flow statistics, metrics for each flow stage. ● Automated error handling and alerting. ● Easy to use by everyone. ● Zero-downtime when upgrading the infrastructure due to logical isolation of each flow stage. ● Open Source
  • 15. Streamsets Data Collector: supported origins
  • 16. Streamsets Data Collector: available destinations
  • 17. Streamsets Data Collector: available processors ● Base64 Field Decoder ● Base64 Field Encoder ● Expression Evaluator ● Field Converter ● JavaScript Evaluator ● JSON Parser ● Jython Evaluator ● Log Parser ● Stream Selector ● XML Parser ...and many others
  • 19. Streamsets DC: performance and reliability ● Two available execution modes: standalone or cluster ● Implemented in Java: so any performance best practice/recommendation for Java applications applies here ● REST services for performance monitoring available ● Rules and alerts (metric and data both)
  • 20. Streamsets Data Collector: security ● You can authenticate user accounts based on LDAP ● Authorization: the Data Collector provides several roles (admin, manager, creator, guest) ● You can use Kerberos authentication to connect to origin and destination systems ● Follow the usual security best practices in terms of iptables, networking, etc. for Java web applications running on Linux machines.
  • 21. Useful Links Streamsets Data Collector: https://streamsets.com/product/
  • 22. Thanks! My contacts: Linkedin: https://ie.linkedin.com/in/giozzia Blog: http://googlielmo.blogspot.ie/ Twitter: https://twitter.com/guglielmoiozzia