August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

•Download as PPTX, PDF•

0 likes•26,051 views

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we'll look at how SDC's "intent-driven" approach keeps the data flowing, whether you're processing data 'off-cluster', in Spark, or in MapReduce. StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations. Speakers: Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

Technology

Open Source Big Data Ingest with
StreamSets Data Collector
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com

Traditional and Big Data
Founders
Company Background
Top tier Investors
Momentum to Date
Strategic Partners
Launched 2014; exited stealth 9/15
~30 employees
Double-digit enterprise customers
10,000 downloads

Past ETL ETL
Emerging Ingest Analyze
Data Sources Data Stores Data Consumers
Market Trends

Data Drift
The unpredictable, unannounced and unending mutation of data characteristics caused by
the operation, maintenance and modernization of the systems that produce the data
Structure
Drift
Semantic
Drift
Infrastructure
Drift

Delayed and
False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data QualityData Drift
Custom code
Fixed-schema

Trusted InsightsData KPIs
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data Drift
Intent-Driven
Drift-Handling

SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remaining 20% is actual data analysis
Example: Data Loss and Corrosion

StreamSets Data Collector
Open source software for the
rapid development and
reliably operation of complex
data flows.
➢ Efficiency
➢ Control
➢ Agility

SDC Demo
StreamSets
Data Collector
Apache Kafka
Apache Kudu
↘
↘

SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA
MapR Big Data Everywhere - Aug 30, San Francisco, CA
Strata + Hadoop World - Sep 27-29, New York, NY
Upcoming Events

Structure
Drift
Data structures and
formats evolve and
change unexpectedly
Implication:
Data Loss
Data Squandering
Delimited Data
107.3.137.195 fe80::21b:21ff:fe83:90fa
Attribute Format
Changes
{
“first“: “jon”
“last“: “smith”
“email“: “jsmith@acme.com”
“add1“: “123 Washington”
“add2“: “”
“city“: “Tucson”
“state“: “AZ”
“zip“: “85756”
}
{
“first“: “jane”
“last“: “smith”
“email“: “jane@earth.net”
“add1“: “456 Fillmore”
“add2“: “Apt 120”
“city“: “Fairfield”
“state“: “VA”
“zip“: “24435-1001”
“phone”: “401-555-1212”
}
Data Structure Evolution
Structure Drift

Semantic
Drift
Data semantics change
with evolving applications
Implication:
Data Corrosion
Data Loss
Semantic Drift
24122-52172 00-24122-52172
Account Number Expansion
M134: user {jsmith} read access granted {ac:24122-52172}
M134: user {jsmith} read access granted {ca.ac:24122-52172}
Namespace Qualification
……
…,3588310669797950,$91.41,jcb,K1088-W#9,…
…,6759006011936944,$155.04,switch,A6504-Y#9,…
…,6771111111151415,$37.78,laser,Q9936-T#9,…
…,3585905063294299,$164.48,jcb,S4643-H#9,…
…,5363527828638736,$117.52,mastercard,X3286-P#9,…
…,4903080150282806,$168.03,switch,I9133-W#3,…
……
Outlier / Anomaly Detection

Infrastructure
Drift
Physical and Logical
Infrastructure changes
rapidly
Implication:
Poor Agility
Operational Downtime
Data Center 1 Data Center 2 Data Center n
3rd Party Service Provider
App a App k
App q
Cloud
Infrastructure
Infrastructure Drift

What's hot

Streamsets and sparkHari Shreedharan

Presto: SQL-on-anythingDataWorks Summit

Solving Performance Problems on HadoopTyler Mitchell

A Stock Prediction System using Open-Source SoftwareFred Melo

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit

Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow

Building Continuously Curated Ingestion PipelinesArvind Prabhakar

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Building a Federated Data Directory Platform for Public HealthDatabricks

Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit

Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit

Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit

Streamsets and spark at SF Hadoop User GroupHari Shreedharan

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

What's hot (20)

Streamsets and spark

Presto: SQL-on-anything

Solving Performance Problems on Hadoop

A Stock Prediction System using Open-Source Software

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Delta Lake: Open Source Reliability w/ Apache Spark

Building Continuously Curated Ingestion Pipelines

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Building a Federated Data Directory Platform for Public Health

Spark in the Enterprise - 2 Years Later by Alan Saldich

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...

Hadoop in Validated Environment - Data Governance Initiative

Enterprise large scale graph analytics and computing base on distribute graph...

Streamsets and spark at SF Hadoop User Group

"Who Moved my Data? - Why tracking changes and sources of data is critical to...

Similar to August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Spark Summit EU talk by Pat PattersonSpark Summit

Oil and gas big data editionMark Kerzner

The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis

BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...Big Data Week

Balancing data democratization with comprehensive information governance: bui...DataWorks Summit

Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks

Hortonworks and HP Vertica WebinarHortonworks

Setting Up the Data LakeCaserta

Big data journey to the cloud maz chaudhri 5.30.18Cloudera, Inc.

8.17.11 big data and hadoop with informatica slideshareJulianna DeLua

2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks

Rev_3 Components of a Data WarehouseRyan Andhavarapu

CWIN17 India / Bigdata architecture yashowardhan sowaleCapgemini

A Winning Strategy for the Digital EconomyEric Kavanagh

Data lake benefitsRicky Barron

Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks

Solving Big Data Problems using Hortonworks DataWorks Summit/Hadoop Summit

IoT Crash Course Hadoop Summit SJDaniel Madrigal

Predictive Analytics - Big Data Warehousing MeetupCaserta

Keeping the Pulse of Your Data: Why You Need Data Observability Precisely

Similar to August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector (20)

Spark Summit EU talk by Pat Patterson

Oil and gas big data edition

The Maturity Model: Taking the Growing Pains Out of Hadoop

BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...

Balancing data democratization with comprehensive information governance: bui...

Hadoop 2.0: YARN to Further Optimize Data Processing

Hortonworks and HP Vertica Webinar

Setting Up the Data Lake

Big data journey to the cloud maz chaudhri 5.30.18

8.17.11 big data and hadoop with informatica slideshare

2015 02 12 talend hortonworks webinar challenges to hadoop adoption

Rev_3 Components of a Data Warehouse

CWIN17 India / Bigdata architecture yashowardhan sowale

A Winning Strategy for the Digital Economy

Data lake benefits

Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...

Solving Big Data Problems using Hortonworks

IoT Crash Course Hadoop Summit SJ

Predictive Analytics - Big Data Warehousing Meetup

Keeping the Pulse of Your Data: Why You Need Data Observability

Recently uploaded

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Install Stable Diffusion in windows machinePadma Pradeep

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

APIForce Zurich 5 April Automation LPDGMarianaLemus7

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Gen AI in Business - Global Trends Report 2024.pdfAddepto

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

DevEX - reference for building teams, processes, and platforms

Install Stable Diffusion in windows machine

SQL Database Design For Developers at php[tek] 2024

Human Factors of XR: Using Human Factors to Design XR Systems

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Ensuring Technical Readiness For Copilot in Microsoft 365

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Connect Wave/ connectwave Pitch Deck Presentation

APIForce Zurich 5 April Automation LPDG

SIP trunking in Janus @ Kamailio World 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Advanced Test Driven-Development @ php[tek] 2024

Gen AI in Business - Global Trends Report 2024.pdf

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Vertex AI Gemini Prompt Engineering Tips

Unleash Your Potential - Namagunga Girls Coding Club

Dev Dives: Streamline document processing with UiPath Studio Web

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

1. Open Source Big Data Ingest with StreamSets Data Collector Pat Patterson Community Champion @metadaddy pat@streamsets.com

2. Traditional and Big Data Founders Company Background Top tier Investors Momentum to Date Strategic Partners Launched 2014; exited stealth 9/15 ~30 employees Double-digit enterprise customers 10,000 downloads

3. Past ETL ETL Emerging Ingest Analyze Data Sources Data Stores Data Consumers Market Trends

4. Data Drift The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift

5. Delayed and False Insights Solving Data Drift Tools Applications Data Stores Data ConsumersData Sources Poor Data QualityData Drift Custom code Fixed-schema

6. Trusted InsightsData KPIs Solving Data Drift Tools Applications Data Stores Data ConsumersData Sources Data Drift Intent-Driven Drift-Handling

7. SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion

8. StreamSets Data Collector Open source software for the rapid development and reliably operation of complex data flows. ➢ Efficiency ➢ Control ➢ Agility

9. SDC Demo StreamSets Data Collector Apache Kafka Apache Kudu ↘ ↘

10. SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA MapR Big Data Everywhere - Aug 30, San Francisco, CA Strata + Hadoop World - Sep 27-29, New York, NY Upcoming Events

11. Thank You!

12. Structure Drift Data structures and formats evolve and change unexpectedly Implication: Data Loss Data Squandering Delimited Data 107.3.137.195 fe80::21b:21ff:fe83:90fa Attribute Format Changes { “first“: “jon” “last“: “smith” “email“: “jsmith@acme.com” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756” } { “first“: “jane” “last“: “smith” “email“: “jane@earth.net” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212” } Data Structure Evolution Structure Drift

13. Semantic Drift Data semantics change with evolving applications Implication: Data Corrosion Data Loss Semantic Drift 24122-52172 00-24122-52172 Account Number Expansion M134: user {jsmith} read access granted {ac:24122-52172} M134: user {jsmith} read access granted {ca.ac:24122-52172} Namespace Qualification …… …,3588310669797950,$91.41,jcb,K1088-W#9,… …,6759006011936944,$155.04,switch,A6504-Y#9,… …,6771111111151415,$37.78,laser,Q9936-T#9,… …,3585905063294299,$164.48,jcb,S4643-H#9,… …,5363527828638736,$117.52,mastercard,X3286-P#9,… …,4903080150282806,$168.03,switch,I9133-W#3,… …… Outlier / Anomaly Detection

14. Infrastructure Drift Physical and Logical Infrastructure changes rapidly Implication: Poor Agility Operational Downtime Data Center 1 Data Center 2 Data Center n 3rd Party Service Provider App a App k App q Cloud Infrastructure Infrastructure Drift

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Similar to August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector