SlideShare a Scribd company logo
Exceptions are the Norm
Dealing with Bad Actors in ETL
Sameer Agarwal
Spark Summit | Boston | Feb 9th 2017
About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)
Overview
1. What’s an ETL Pipeline?
- How is it different from a regular query execution pipeline?
2. Using SparkSQL for ETL
- Dealing with Dirty Data (Bad Records or Files)
- Performance (Project Tungsten)
3. New Features in Spark 2.2 and 2.3
- Focus on building ETL-friendly pipelines
What is a Data Pipeline?
1. Sequence of transformations on data
2. Source data is typically semi-structured/unstructured
(JSON, CSV etc.)
3. Output data is structured and ready for use by analysts
and data scientists
4. Source and destination are often on different storage
systems.
Example of a Data Pipeline
Aggregate
Reporting
Applications
ML
Model
Ad-hoc Queries
Kafka Database
Cloud
Warehouse
Logs
ETL is the First Step in a Data Pipeline
1. ETL stands for EXTRACT, TRANSFORM and LOAD
2. Goal is to “clean” or “curate” the data
- Retrieve data from source (EXTRACT)
- Transform data into a consumable format (TRANSFORM)
- Transmit data to downstream consumers (LOAD)
An ETL Query in Spark
Extractspark.read.csv("/source/path”) EXTRACT
An ETL Query in Spark
Extractspark.read.csv("/source/path")
.filter(...)
.agg(...)
EXTRACT
TRANSFORM
An ETL Query in Spark
Extractspark.read.csv("/source/path")
.filter(...)
.agg(...)
.write.mode("append")
.parquet("/output/path")
EXTRACT
TRANSFORM
LOAD
What’s so hard about ETL
Queries?
Why is ETL Hard?
1. Data can be Messy
- Incomplete information
- Missing data stored as empty strings, “none”, “missing”, “xxx” etc.
2. Data can be Inconsistent
- Data conversion and type validation in many cases is error-prone
- For e.g., expecting a number but found ”123 000”
- different formats “31/12/2017” “12/31/2017”
- Incorrect information
- For e.g., expecting 5 fields in CSV, but can’t find 5 fields.
Why is ETL Hard?
3. Data can be Constantly Arriving
- At least once or exactly once semantics
- Fault tolerance
- Scalability
4. Data can be Complex
- For e.g., Nested JSON data to extract and flatten
- Dealing with inconsistency is even worse
This is why ETL is important
Consumers of this data don’t want to deal with this
messiness and complexity
On the flip side
1. A few bad records can fail a job
• These are not the same as transient errors
• No recourse for recovery
2. Support for ETL features
• File formats and conversions have gaps
• For e.g., multi-line support, date conversions
3. Performance
Spark's flexible APIs, support for a wide variety of
datasources and state of art tungsten execution
engine makes it a great framework for building
end-to-end ETL Pipelines
Using SparkSQL for ETL
Dealing with Bad Data: Skip Corrupt Files
spark.read.csv("/source/path")
.filter(...)
.agg(...)
.write.mode("append")
.parquet("/output/path")
Dealing with Bad Data: Skip Corrupt Files
spark.read.csv("/source/path")
.filter(...)
.agg(...)
.write.mode("append")
.parquet("/output/path") Missing or
Corrupt
File
Dealing with Bad Data: Skip Corrupt Files
spark.sql.files.ignoreCorruptFiles = true
Missing or
Corrupt
File
[SPARK-17850] If true,
the Spark jobs will
continue to run even
when it encounters
corrupt or non-existent
files. The contents that
have been read will still
be returned.
Dealing with Bad Data: Skip Corrupt Records
Missing or
Corrupt
Records
Dealing with Bad Data: Skip Corrupt Records
Missing or
Corrupt
Records
[SPARK-12833][SPARK-
13764] TextFile formats
(JSON and CSV) support 3
different ParseModes
while reading data:
1. PERMISSIVE
2. DROPMALFORMED
3. FAILFAST
JSON: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "PERMISSIVE")
.json(corruptRecords)
.show()
Can be configured via
spark.sql.columnNameOfCorruptRecord
JSON: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", ”DROPMALFORMED")
.json(corruptRecords)
.show()
JSON: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", ”FAILFAST")
.json(corruptRecords)
.show()
org.apache.spark.sql.catalyst.json
.SparkSQLJsonProcessingException:
Malformed line in FAILFAST mode:
{"a":{, b:3}
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.format("csv")
.option("mode", "PERMISSIVE")
.load(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.format("csv")
.option("mode", ”DROPMALFORMED")
.load(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.format("csv")
.option("mode", ”FAILFAST")
.load(corruptRecords)
.show()
java.lang.RuntimeException:
Malformed line in FAILFAST mode:
2015,Chevy,Volt
Spark Performance: Project Tungsten
Substantially improve the memory and CPU efficiency of Spark backend
execution and push performance closer to the limits of modern hardware.
Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced	
Analytics
Phase 1
Foundation
Memory Management
Code Generation
Cache-aware Algorithms
Phase 2
Order-of-magnitude Faster
Whole-stage Codegen
Vectorization
Spark Performance: Project Tungsten
SparkSQL: A Compiler from Queries to RDDs (Developer Track at 5:40pm)
Operator Benchmarks: Cost/Row (ns)
5-30x
Speedups
Operator Benchmarks: Cost/Row (ns)
Radix Sort
10-100x
Speedups
Operator Benchmarks: Cost/Row (ns)
Shuffling
still the
bottleneck
Operator Benchmarks: Cost/Row (ns)
10x
Speedup
TPC-DS (Scale Factor 1500, 100 cores)
QueryTime
Query #
Spark 2.0 Spark 1.6
Lower is Better
Apache Spark 2.2 and 2.3
Massive focus on building ETL-friendly pipelines
New Features in Spark 2.2 and 2.3
1. Better Functionality:
- Improved JSON and CSV Support
2. Better Usability:
- Better Error Messages
3. Better Performance:
- SQL Execution
- Python UDF Processing
Functionality: Better JSON Support
1. [SPARK-18352] Multi-line JSON Support
- Spark currently reads JSON one line at a time
- This currently requires custom ETL
spark.read
.option("wholeFile",true)
.json(path)
Availability: Spark 2.2
Functionality: Better JSON Support
2. [SPARK-19480] Higher order functions in SQL
- Enable users to manipulate nested data in Spark
- Operations include map, filter, reduce on arrays/maps
tbl_x
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Functionality: Better JSON Support
2. [SPARK-19480] Higher order functions in SQL
Availability: Spark 2.3+
tbl_x
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
SELECT key, TRANSFORM(values, v -> v + key)
FROM tbl_x
Functionality: Better CSV Support
1. [SPARK-16099] Improved/Performant CSV Datasource
- Multiline CSV Support
- Additional options for CSV Parsing
- Whole text reader for dataframes
Availability: Spark 2.2
Functionality: Better ETL Support
1. More Fine-grained (record-level) tolerance to errors
- Provide users with controls on how to handle these errors
- Ignore and report errors post-hoc
- Ignore bad rows up to a certain number or percentage
Availability: Spark 2.2
Usability: Better Error Messages
1. Spark must explain why data is bad
2. This is especially true for data conversion
- scala.MatchError: start (of class
java.lang.String)
3. Which row in your source data could not be converted ?
4. Which column could not be converted ?
Availability: Spark 2.2 and 2.3
Performance: SQL Execution
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- In-memory/ single node shuffle
3. Improving quality of generated code and better integration
with the in-memory column format in Spark
Performance: Python UDFs
1. Python is the most popular language for ETL
2. Python UDFs are often used to express elaborate data
conversions/transformations
3. Any improvements to python UDF processing will ultimately
improve ETL.
4. Next talk: Improving Python and Spark Performance and
Interoperability (Wes McKinney)
Availability: Spark 2.3+
Recap
1. What’s an ETL Pipeline?
- How is it different from a regular query execution pipeline?
2. Using SparkSQL for ETL
- Dealing with Dirty Data (Bad Records or Files)
- Performance (Project Tungsten)
3. New Features in Spark 2.2 and 2.3
- Focus on building ETL-friendly pipelines
Questions?

More Related Content

What's hot

Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringInfluxData
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to KibanaVineet .
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...aiuy
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 

What's hot (20)

Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 

Viewers also liked

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingDatabricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Spark Summit
 
Spark Summit Europe 2016 Keynote - Databricks CEO
Spark Summit Europe 2016 Keynote  - Databricks CEO Spark Summit Europe 2016 Keynote  - Databricks CEO
Spark Summit Europe 2016 Keynote - Databricks CEO Databricks
 

Viewers also liked (20)

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
 
Spark Summit Europe 2016 Keynote - Databricks CEO
Spark Summit Europe 2016 Keynote  - Databricks CEO Spark Summit Europe 2016 Keynote  - Databricks CEO
Spark Summit Europe 2016 Keynote - Databricks CEO
 

Similar to Exceptions are the Norm: Dealing with Bad Actors in ETL

Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
 
What;s Coming In SPARQL2?
What;s Coming In SPARQL2?What;s Coming In SPARQL2?
What;s Coming In SPARQL2?LeeFeigenbaum
 
Entity Framework
Entity FrameworkEntity Framework
Entity Frameworkvrluckyin
 
Entity framework
Entity frameworkEntity framework
Entity frameworkicubesystem
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1Marco Gralike
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
IBM Solutions '99 XML and Java: Lessons Learned
IBM Solutions '99 XML and Java: Lessons LearnedIBM Solutions '99 XML and Java: Lessons Learned
IBM Solutions '99 XML and Java: Lessons LearnedTed Leung
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationPhilip Yurchuk
 

Similar to Exceptions are the Norm: Dealing with Bad Actors in ETL (20)

Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
SPARQL 1.1 Status
SPARQL 1.1 StatusSPARQL 1.1 Status
SPARQL 1.1 Status
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
What;s Coming In SPARQL2?
What;s Coming In SPARQL2?What;s Coming In SPARQL2?
What;s Coming In SPARQL2?
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
Entity Framework
Entity FrameworkEntity Framework
Entity Framework
 
Entity framework
Entity frameworkEntity framework
Entity framework
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
IBM Solutions '99 XML and Java: Lessons Learned
IBM Solutions '99 XML and Java: Lessons LearnedIBM Solutions '99 XML and Java: Lessons Learned
IBM Solutions '99 XML and Java: Lessons Learned
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
ETL Testing.pdf
ETL Testing.pdfETL Testing.pdf
ETL Testing.pdf
 
Microservices in Clojure
Microservices in ClojureMicroservices in Clojure
Microservices in Clojure
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data Integration
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareinfo611746
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessWSO2
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandIES VE
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 

Recently uploaded (20)

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 

Exceptions are the Norm: Dealing with Bad Actors in ETL

  • 1. Exceptions are the Norm Dealing with Bad Actors in ETL Sameer Agarwal Spark Summit | Boston | Feb 9th 2017
  • 2. About Me • Software Engineer at Databricks (Spark Core/SQL) • PhD in Databases (AMPLab, UC Berkeley) • Research on BlinkDB (Approximate Queries in Spark)
  • 3. Overview 1. What’s an ETL Pipeline? - How is it different from a regular query execution pipeline? 2. Using SparkSQL for ETL - Dealing with Dirty Data (Bad Records or Files) - Performance (Project Tungsten) 3. New Features in Spark 2.2 and 2.3 - Focus on building ETL-friendly pipelines
  • 4. What is a Data Pipeline? 1. Sequence of transformations on data 2. Source data is typically semi-structured/unstructured (JSON, CSV etc.) 3. Output data is structured and ready for use by analysts and data scientists 4. Source and destination are often on different storage systems.
  • 5. Example of a Data Pipeline Aggregate Reporting Applications ML Model Ad-hoc Queries Kafka Database Cloud Warehouse Logs
  • 6. ETL is the First Step in a Data Pipeline 1. ETL stands for EXTRACT, TRANSFORM and LOAD 2. Goal is to “clean” or “curate” the data - Retrieve data from source (EXTRACT) - Transform data into a consumable format (TRANSFORM) - Transmit data to downstream consumers (LOAD)
  • 7. An ETL Query in Spark Extractspark.read.csv("/source/path”) EXTRACT
  • 8. An ETL Query in Spark Extractspark.read.csv("/source/path") .filter(...) .agg(...) EXTRACT TRANSFORM
  • 9. An ETL Query in Spark Extractspark.read.csv("/source/path") .filter(...) .agg(...) .write.mode("append") .parquet("/output/path") EXTRACT TRANSFORM LOAD
  • 10. What’s so hard about ETL Queries?
  • 11. Why is ETL Hard? 1. Data can be Messy - Incomplete information - Missing data stored as empty strings, “none”, “missing”, “xxx” etc. 2. Data can be Inconsistent - Data conversion and type validation in many cases is error-prone - For e.g., expecting a number but found ”123 000” - different formats “31/12/2017” “12/31/2017” - Incorrect information - For e.g., expecting 5 fields in CSV, but can’t find 5 fields.
  • 12. Why is ETL Hard? 3. Data can be Constantly Arriving - At least once or exactly once semantics - Fault tolerance - Scalability 4. Data can be Complex - For e.g., Nested JSON data to extract and flatten - Dealing with inconsistency is even worse
  • 13. This is why ETL is important Consumers of this data don’t want to deal with this messiness and complexity
  • 14. On the flip side 1. A few bad records can fail a job • These are not the same as transient errors • No recourse for recovery 2. Support for ETL features • File formats and conversions have gaps • For e.g., multi-line support, date conversions 3. Performance
  • 15. Spark's flexible APIs, support for a wide variety of datasources and state of art tungsten execution engine makes it a great framework for building end-to-end ETL Pipelines
  • 17. Dealing with Bad Data: Skip Corrupt Files spark.read.csv("/source/path") .filter(...) .agg(...) .write.mode("append") .parquet("/output/path")
  • 18. Dealing with Bad Data: Skip Corrupt Files spark.read.csv("/source/path") .filter(...) .agg(...) .write.mode("append") .parquet("/output/path") Missing or Corrupt File
  • 19. Dealing with Bad Data: Skip Corrupt Files spark.sql.files.ignoreCorruptFiles = true Missing or Corrupt File [SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt or non-existent files. The contents that have been read will still be returned.
  • 20. Dealing with Bad Data: Skip Corrupt Records Missing or Corrupt Records
  • 21. Dealing with Bad Data: Skip Corrupt Records Missing or Corrupt Records [SPARK-12833][SPARK- 13764] TextFile formats (JSON and CSV) support 3 different ParseModes while reading data: 1. PERMISSIVE 2. DROPMALFORMED 3. FAILFAST
  • 22. JSON: Dealing with Corrupt Records {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "PERMISSIVE") .json(corruptRecords) .show() Can be configured via spark.sql.columnNameOfCorruptRecord
  • 23. JSON: Dealing with Corrupt Records {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", ”DROPMALFORMED") .json(corruptRecords) .show()
  • 24. JSON: Dealing with Corrupt Records {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", ”FAILFAST") .json(corruptRecords) .show() org.apache.spark.sql.catalyst.json .SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a":{, b:3}
  • 25. CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .format("csv") .option("mode", "PERMISSIVE") .load(corruptRecords) .show()
  • 26. CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .format("csv") .option("mode", ”DROPMALFORMED") .load(corruptRecords) .show()
  • 27. CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .format("csv") .option("mode", ”FAILFAST") .load(corruptRecords) .show() java.lang.RuntimeException: Malformed line in FAILFAST mode: 2015,Chevy,Volt
  • 28. Spark Performance: Project Tungsten Substantially improve the memory and CPU efficiency of Spark backend execution and push performance closer to the limits of modern hardware. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics
  • 29. Phase 1 Foundation Memory Management Code Generation Cache-aware Algorithms Phase 2 Order-of-magnitude Faster Whole-stage Codegen Vectorization Spark Performance: Project Tungsten SparkSQL: A Compiler from Queries to RDDs (Developer Track at 5:40pm)
  • 30. Operator Benchmarks: Cost/Row (ns) 5-30x Speedups
  • 31. Operator Benchmarks: Cost/Row (ns) Radix Sort 10-100x Speedups
  • 32. Operator Benchmarks: Cost/Row (ns) Shuffling still the bottleneck
  • 33. Operator Benchmarks: Cost/Row (ns) 10x Speedup
  • 34. TPC-DS (Scale Factor 1500, 100 cores) QueryTime Query # Spark 2.0 Spark 1.6 Lower is Better
  • 35. Apache Spark 2.2 and 2.3 Massive focus on building ETL-friendly pipelines
  • 36. New Features in Spark 2.2 and 2.3 1. Better Functionality: - Improved JSON and CSV Support 2. Better Usability: - Better Error Messages 3. Better Performance: - SQL Execution - Python UDF Processing
  • 37. Functionality: Better JSON Support 1. [SPARK-18352] Multi-line JSON Support - Spark currently reads JSON one line at a time - This currently requires custom ETL spark.read .option("wholeFile",true) .json(path) Availability: Spark 2.2
  • 38. Functionality: Better JSON Support 2. [SPARK-19480] Higher order functions in SQL - Enable users to manipulate nested data in Spark - Operations include map, filter, reduce on arrays/maps tbl_x |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false)
  • 39. Functionality: Better JSON Support 2. [SPARK-19480] Higher order functions in SQL Availability: Spark 2.3+ tbl_x |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) SELECT key, TRANSFORM(values, v -> v + key) FROM tbl_x
  • 40. Functionality: Better CSV Support 1. [SPARK-16099] Improved/Performant CSV Datasource - Multiline CSV Support - Additional options for CSV Parsing - Whole text reader for dataframes Availability: Spark 2.2
  • 41. Functionality: Better ETL Support 1. More Fine-grained (record-level) tolerance to errors - Provide users with controls on how to handle these errors - Ignore and report errors post-hoc - Ignore bad rows up to a certain number or percentage Availability: Spark 2.2
  • 42. Usability: Better Error Messages 1. Spark must explain why data is bad 2. This is especially true for data conversion - scala.MatchError: start (of class java.lang.String) 3. Which row in your source data could not be converted ? 4. Which column could not be converted ? Availability: Spark 2.2 and 2.3
  • 43. Performance: SQL Execution 1. SPARK-16026: Cost Based Optimizer - Leverage table/column level statistics to optimize joins and aggregates - Statistics Collection Framework (Spark 2.1) - Cost Based Optimizer (Spark 2.2) 2. Boosting Spark’s Performance on Many-Core Machines - In-memory/ single node shuffle 3. Improving quality of generated code and better integration with the in-memory column format in Spark
  • 44. Performance: Python UDFs 1. Python is the most popular language for ETL 2. Python UDFs are often used to express elaborate data conversions/transformations 3. Any improvements to python UDF processing will ultimately improve ETL. 4. Next talk: Improving Python and Spark Performance and Interoperability (Wes McKinney) Availability: Spark 2.3+
  • 45. Recap 1. What’s an ETL Pipeline? - How is it different from a regular query execution pipeline? 2. Using SparkSQL for ETL - Dealing with Dirty Data (Bad Records or Files) - Performance (Project Tungsten) 3. New Features in Spark 2.2 and 2.3 - Focus on building ETL-friendly pipelines