Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Data Pipelines with
Spark and StreamSets
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com
Agenda
Data Drift
StreamSets Data Collector
Running Pipelines on Spark Today
Future Spark Integration
Demo
Past ETL ETL
Emerging Ingest Analyze
Data Sources Data Stores Data Consumers
The Evolution of Data-in-Motion
Data Drift - a Data Engineering Headache
The unpredictable, unannounced and unending mutation of data
characteristics caus...
SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remainin...
Delayed and
False Insights
Solving Data Drift
Tools
Applications
Data Stores Data Consumers
Poor Data QualityData Drift
Cu...
Solving Data Drift
Trusted InsightsData KPIs
Tools
Applications
Data Drift
Intent-Driven
Drift-Handling
Data Stores Data C...
StreamSets Data Collector
Open source software for the
rapid development and reliably
operation of complex data
flows.
➢ I...
Handling Drift with Hive
➢ Monitor data structure
➢ Detect schema change
➢ Alter Hive Metadata
Running Pipelines on Spark Today
spark-submit
--num-executors ...
--archives ...
--files ...
--jar ...
--class ... ● Conta...
SDC on Spark - Connectivity
Sources
• Kafka
Destinations
• HDFS
• HBase
• S3
• Kudu
• MapR DB
• Cassandra
• ElasticSearch
...
Future Directions
Run Pipelines on Databricks
● Container on Databricks
● Leverage REST API
● Add S3 origin
{
"name":"StreamSets Data Collec...
Break Out Spark Processor
● Standalone containers,
Spark processor
● Leverage Spark code
● Custom RDD
● Start local Spark ...
Spark Processor - Connectivity
Sources
• Kafka
• S3
• MapR Streams
• JDBC
• MongoDB
• Local Filesystem
• Redis
• JMS
• HTT...
Deepen Spark Integration
● Container on Spark, Spark processor
● Leverage Spark code
● Custom RDD
● Start Spark job ‘on cl...
Spark Integration Architecture
SDC on Spark - Connectivity Tomorrow
Sources
• Kafka
• S3
• MapR Streams
• JDBC
• MongoDB
• Redis
• JMS
• HTTP
• UDP
• ......
Demo
Conclusion
StreamSets Data Collector brings a UI abstraction to Spark
Standalone container + local Spark Processor bring w...
Resources
Download StreamSets Data Collector
https://streamsets.com/opensource
Contribute Code
https://github.com/streamse...
Thank You!
Backup Slides
Structure
Drift
Data structures and
formats evolve and change
unexpectedly
Implication:
Data Loss
Data Squandering
Delimit...
Semantic
Drift
Data semantics change
with evolving applications
Implication:
Data Corrosion
Data Loss
Semantic Drift
24122...
Infrastructure
Drift
Physical and Logical
Infrastructure changes
rapidly
Implication:
Poor Agility
Operational Downtime
Da...
Upcoming SlideShare
Loading in …5
×

Building Data Pipelines with Spark and StreamSets

2,597 views

Published on

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.

Published in: Software

Building Data Pipelines with Spark and StreamSets

  1. 1. Building Data Pipelines with Spark and StreamSets Pat Patterson Community Champion @metadaddy pat@streamsets.com
  2. 2. Agenda Data Drift StreamSets Data Collector Running Pipelines on Spark Today Future Spark Integration Demo
  3. 3. Past ETL ETL Emerging Ingest Analyze Data Sources Data Stores Data Consumers The Evolution of Data-in-Motion
  4. 4. Data Drift - a Data Engineering Headache The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift
  5. 5. SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion
  6. 6. Delayed and False Insights Solving Data Drift Tools Applications Data Stores Data Consumers Poor Data QualityData Drift Custom code Fixed-schema Data Sources // DIY Custom Code
  7. 7. Solving Data Drift Trusted InsightsData KPIs Tools Applications Data Drift Intent-Driven Drift-Handling Data Stores Data ConsumersData Sources // DIY Custom Code
  8. 8. StreamSets Data Collector Open source software for the rapid development and reliably operation of complex data flows. ➢ Intent-driven ➢ UI Abstraction ➢ Extensible
  9. 9. Handling Drift with Hive ➢ Monitor data structure ➢ Detect schema change ➢ Alter Hive Metadata
  10. 10. Running Pipelines on Spark Today spark-submit --num-executors ... --archives ... --files ... --jar ... --class ... ● Container on Spark ● Leverage Kafka RDD ● Scale out for performance
  11. 11. SDC on Spark - Connectivity Sources • Kafka Destinations • HDFS • HBase • S3 • Kudu • MapR DB • Cassandra • ElasticSearch • Kafka • MapR Streams • Kinesis • etc, etc, etc!
  12. 12. Future Directions
  13. 13. Run Pipelines on Databricks ● Container on Databricks ● Leverage REST API ● Add S3 origin { "name":"StreamSets Data Collector", "new_cluster":{ "num_workers":1 }, "spark_jar_task":{ "main_class_name":"com.streamsets..." } }
  14. 14. Break Out Spark Processor ● Standalone containers, Spark processor ● Leverage Spark code ● Custom RDD ● Start local Spark job for each batch ● Example use cases: running image classification, sentiment analysis Local Mode
  15. 15. Spark Processor - Connectivity Sources • Kafka • S3 • MapR Streams • JDBC • MongoDB • Local Filesystem • Redis • JMS • HTTP • UDP • etc, etc, etc! Destinations • HDFS • HBase • S3 • Kudu • MapR DB • Cassandra • ElasticSearch • Kafka • MapR Streams • JDBC • etc, etc, etc!
  16. 16. Deepen Spark Integration ● Container on Spark, Spark processor ● Leverage Spark code ● Custom RDD ● Start Spark job ‘on cluster’ for each pipeline ● Example use cases: training image classification, sentiment analysis
  17. 17. Spark Integration Architecture
  18. 18. SDC on Spark - Connectivity Tomorrow Sources • Kafka • S3 • MapR Streams • JDBC • MongoDB • Redis • JMS • HTTP • UDP • ...any partitionable data source... Destinations • HDFS • HBase • S3 • Kudu • MapR DB • Cassandra • ElasticSearch • Kafka • MapR Streams • JDBC • etc, etc, etc!
  19. 19. Demo
  20. 20. Conclusion StreamSets Data Collector brings a UI abstraction to Spark Standalone container + local Spark Processor bring wide connectivity to Spark code Spark Container + Spark Processor allow iterative Spark code in pipelines
  21. 21. Resources Download StreamSets Data Collector https://streamsets.com/opensource Contribute Code https://github.com/streamsets/datacollector Get Involved https://streamsets.com/community
  22. 22. Thank You!
  23. 23. Backup Slides
  24. 24. Structure Drift Data structures and formats evolve and change unexpectedly Implication: Data Loss Data Squandering Delimited Data 107.3.137.195 fe80::21b:21ff:fe83:90fa Attribute Format Changes { “first“: “jon” “last“: “smith” “email“: “jsmith@acme.com” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756” } { “first“: “jane” “last“: “smith” “email“: “jane@earth.net” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212” } Data Structure Evolution Structure Drift
  25. 25. Semantic Drift Data semantics change with evolving applications Implication: Data Corrosion Data Loss Semantic Drift 24122-52172 00-24122-52172 Account Number Expansion M134: user {jsmith} read access granted {ac:24122-52172} M134: user {jsmith} read access granted {ca.ac:24122-52172} Namespace Qualification …… …,3588310669797950,$91.41,jcb,K1088-W#9,… …,6759006011936944,$155.04,switch,A6504-Y#9,… …,6771111111151415,$37.78,laser,Q9936-T#9,… …,3585905063294299,$164.48,jcb,S4643-H#9,… …,5363527828638736,$117.52,mastercard,X3286-P#9,… …,4903080150282806,$168.03,switch,I9133-W#3,… …… Outlier / Anomaly Detection
  26. 26. Infrastructure Drift Physical and Logical Infrastructure changes rapidly Implication: Poor Agility Operational Downtime Data Center 1 Data Center 2 Data Center n 3rd Party Service Provider App a App k App q Cloud Infrastructure Infrastructure Drift

×