Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Open Source Big Data Ingest with
StreamSets Data Collector
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com
Traditional and Big Data
Founders
Company Background
Top tier Investors
Momentum to Date
Strategic Partners
Launched 2014;...
Past ETL ETL
Emerging Ingest Analyze
Data Sources Data Stores Data Consumers
Market Trends
Data Drift
The unpredictable, unannounced and unending mutation of data characteristics caused by
the operation, maintenan...
Delayed and
False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data QualityD...
Trusted InsightsData KPIs
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data Drift
Intent-D...
SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remainin...
StreamSets Data Collector
Open source software for the
rapid development and
reliably operation of complex
data flows.
➢ E...
SDC Demo
StreamSets
Data Collector
Apache Kafka
Apache Kudu
↘
↘
SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA
MapR Big Data Everywhere - Aug 30, San Francisco, CA
Strata + Hadoo...
Thank You!
Structure
Drift
Data structures and
formats evolve and
change unexpectedly
Implication:
Data Loss
Data Squandering
Delimit...
Semantic
Drift
Data semantics change
with evolving applications
Implication:
Data Corrosion
Data Loss
Semantic Drift
24122...
Infrastructure
Drift
Physical and Logical
Infrastructure changes
rapidly
Implication:
Poor Agility
Operational Downtime
Da...
Upcoming SlideShare
Loading in …5
×

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

25,558 views

Published on

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we'll look at how SDC's "intent-driven" approach keeps the data flowing, whether you're processing data 'off-cluster', in Spark, or in MapReduce.
StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations.

Speakers:
Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

Published in: Technology
  • Dating for everyone is here: ❤❤❤ http://bit.ly/39mQKz3 ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/39mQKz3 ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

  1. 1. Open Source Big Data Ingest with StreamSets Data Collector Pat Patterson Community Champion @metadaddy pat@streamsets.com
  2. 2. Traditional and Big Data Founders Company Background Top tier Investors Momentum to Date Strategic Partners Launched 2014; exited stealth 9/15 ~30 employees Double-digit enterprise customers 10,000 downloads
  3. 3. Past ETL ETL Emerging Ingest Analyze Data Sources Data Stores Data Consumers Market Trends
  4. 4. Data Drift The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift
  5. 5. Delayed and False Insights Solving Data Drift Tools Applications Data Stores Data ConsumersData Sources Poor Data QualityData Drift Custom code Fixed-schema
  6. 6. Trusted InsightsData KPIs Solving Data Drift Tools Applications Data Stores Data ConsumersData Sources Data Drift Intent-Driven Drift-Handling
  7. 7. SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion
  8. 8. StreamSets Data Collector Open source software for the rapid development and reliably operation of complex data flows. ➢ Efficiency ➢ Control ➢ Agility
  9. 9. SDC Demo StreamSets Data Collector Apache Kafka Apache Kudu ↘ ↘
  10. 10. SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA MapR Big Data Everywhere - Aug 30, San Francisco, CA Strata + Hadoop World - Sep 27-29, New York, NY Upcoming Events
  11. 11. Thank You!
  12. 12. Structure Drift Data structures and formats evolve and change unexpectedly Implication: Data Loss Data Squandering Delimited Data 107.3.137.195 fe80::21b:21ff:fe83:90fa Attribute Format Changes { “first“: “jon” “last“: “smith” “email“: “jsmith@acme.com” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756” } { “first“: “jane” “last“: “smith” “email“: “jane@earth.net” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212” } Data Structure Evolution Structure Drift
  13. 13. Semantic Drift Data semantics change with evolving applications Implication: Data Corrosion Data Loss Semantic Drift 24122-52172 00-24122-52172 Account Number Expansion M134: user {jsmith} read access granted {ac:24122-52172} M134: user {jsmith} read access granted {ca.ac:24122-52172} Namespace Qualification …… …,3588310669797950,$91.41,jcb,K1088-W#9,… …,6759006011936944,$155.04,switch,A6504-Y#9,… …,6771111111151415,$37.78,laser,Q9936-T#9,… …,3585905063294299,$164.48,jcb,S4643-H#9,… …,5363527828638736,$117.52,mastercard,X3286-P#9,… …,4903080150282806,$168.03,switch,I9133-W#3,… …… Outlier / Anomaly Detection
  14. 14. Infrastructure Drift Physical and Logical Infrastructure changes rapidly Implication: Poor Agility Operational Downtime Data Center 1 Data Center 2 Data Center n 3rd Party Service Provider App a App k App q Cloud Infrastructure Infrastructure Drift

×