SlideShare a Scribd company logo
Scalable and Incremental
Data Profiling with Spark
Amelia Arbisser
Adam Silberstein
Trifacta: Self-service data preparation
What data analysts hope to achieve in data projects
Analysis & Consumption
MACHINE LEARNING VISUALIZATION STATISTICS
Trifacta: Self-service data preparation
What data analysts hope to achieve in data projects
80% of time spent cleaning and preparing the data
to be analyzed
Analysis & Consumption
MACHINE LEARNING VISUALIZATION STATISTICS
Trifacta: Self-service data preparation
We speed up and take the pain out of preparation
Transforming Data in Trifacta
Transforming Data in Trifacta
Predictive Interaction and Immediate Feedback
Predictive Interaction
Immediate Feedback
Transforming Data in Trifacta
Transforming Data in Trifacta
Where Profiling Fits In
We validate preparation via profiling
Spark Profiling at Trifacta
➔ Profiling results of transformation at scale
➔ Validation through profiles
➔ Challenges
➔Scale
➔Automatic job generation
➔ Our solution
➔Spark profiling job server
➔JSON spec
➔Pay-as-you-go
The Case for Profiling
Transforming Data in Trifacta
Job Results
➔ Even clean raw data is not
informative.
➔Especially when it is too
large to inspect manually.
➔Need a summary representation.
➔Statistical and visual
➔ Generate profile
programmatically.
Visual Result Validation
➔ Similar to the profiling we saw above
the data grid.
➔ Applied at scale, to the full result.
➔ Reveals mismatched, missing
values.
Visual Result Validation
➔ Similar to the profiling we saw above
the data grid.
➔ Applied at scale, to the full result.
➔ Reveals mismatched, missing
values.
Visual Result Validation
Challenges
The Goal: Profiling for Every Job
➔ We’ve talked about the value of profiling, but...
➔Profiling is not a tablestakes feature, at least not on day 1.
➔Don’t want our users to disable it!
➔Profiling is potentially more expensive than transformation.
Profiling at Scale
➔ Large data volume
➔Long running times
➔Memory constraints
Performance vs. Accuracy
➔Approximation brings performance gains while still
meeting profiler summarization requirements.
➔Off-the-shelf libraries (count-min-sketch, T-digest) great
for approximating counts and non-distributive stats.
➔Not a silver bullet though...sometimes approximations are
confusing.
Flexible Job Generation
➔Profile for all users, all use cases: not
a one-off
➔Diverse schemas
➔Any number of columns
➔ Calculate statistics for all data types
➔Numeric vs. categorical
➔Container types (maps, arrays,...)
➔User-defined
Flexible Job Generation
➔Profile for all users, all use cases: not
a one-off
➔Diverse schemas
➔Any number of columns
➔ Calculate statistics for all data types
➔Numeric vs. categorical
➔Container types (maps, arrays,...)
➔User-defined
Transformation vs. Profiling
➔ Transformation is primarily row-based.
➔ Profiling is column-based.
➔ Different execution concerns.
Solution: Spark in the
Trifacta System
Why Spark
➔ Effective at OLAP
➔ Flexible enough for UDFs
➔ Not tied to distributions
➔ Mass adoption
➔ Low latency
Spark Profile Jobs Server
UI
Webapp
Batch
Server
Profiler
Service
Client
Server
Cluster
HDFS
➔ We built a Spark job server
➔ Automatically profiles output of
transform jobs
➔ Outputs profiling results to
HDFS
➔ Renders graphs in Trifacta
product Batch
ServerBatSBatS
Workers
Profiler Service ➔ Which columns to profile and
their types
➔ Uses Trifacta type system, eg:
➔Dates
➔Geography
➔User-defined
➔Requested profiling statistics
➔Histograms
➔Outliers
➔Empty, invalid, valid
➔ Extensible
➔Pairwise statistics
➔User-defined functions
Profiler
Service
JSON
Specification
Wrangled
Data
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
Performance Improvements
➔ A few troublesome use cases:
➔High cardinality
➔Non-distributive stats
➔ Huge gains for large numbers of
columns
Few
Columns
Many Columns
High
Cardinality
2X 4X
Low
Cardinality
10X 20X
Spark profiling speed-up vs. MapReduce
Pay-as-you-go Profiling
➔ Users do not always need
profiles for all columns
➔ More complex and
expensive statistics
➔ Drilldown
➔ Iterative exploration
Conclusion
Profiling informs data preparation.
Questions?

More Related Content

What's hot

Audio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudioAudio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudio
SeungHeon Doh
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seeling Cheung
 
Elastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and CloudElastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and Cloud
Joe Ryan
 
Managing Data Science Projects
Managing Data Science ProjectsManaging Data Science Projects
Managing Data Science Projects
Danielle Dean
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
Rohit Sharma
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
Sivashankar Ganapathy
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
Benjamin Bengfort
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
Databricks
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
DataHub
DataHubDataHub
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
Chris Mungall
 
Teradata a z
Teradata a zTeradata a z
Teradata a z
Dhanasekar T
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
Tyler Wishnoff
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
Amazon Web Services
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Databricks
 
CLE Webinar: eSignature, an overview of legal validity and case law
CLE Webinar: eSignature, an overview of legal validity and case lawCLE Webinar: eSignature, an overview of legal validity and case law
CLE Webinar: eSignature, an overview of legal validity and case law
DocuSign
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 

What's hot (20)

Audio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudioAudio data preprocessing and data loading using torchaudio
Audio data preprocessing and data loading using torchaudio
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
Elastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and CloudElastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and Cloud
 
Managing Data Science Projects
Managing Data Science ProjectsManaging Data Science Projects
Managing Data Science Projects
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
DataHub
DataHubDataHub
DataHub
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
 
Teradata a z
Teradata a zTeradata a z
Teradata a z
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
 
CLE Webinar: eSignature, an overview of legal validity and case law
CLE Webinar: eSignature, an overview of legal validity and case lawCLE Webinar: eSignature, an overview of legal validity and case law
CLE Webinar: eSignature, an overview of legal validity and case law
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 

Similar to Scalable And Incremental Data Profiling With Spark

Building Audience Analytics Platform
Building Audience Analytics PlatformBuilding Audience Analytics Platform
Building Audience Analytics Platform
InMobi Technology
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Noriaki Tatsumi
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
WSO2
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 
Honey I Shrunk the Database
Honey I Shrunk the DatabaseHoney I Shrunk the Database
Honey I Shrunk the DatabaseVanessa Hurst
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
marcja
 
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Cathrine Wilhelmsen
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
WSO2
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB
 

Similar to Scalable And Incremental Data Profiling With Spark (20)

Building Audience Analytics Platform
Building Audience Analytics PlatformBuilding Audience Analytics Platform
Building Audience Analytics Platform
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
Honey I Shrunk the Database
Honey I Shrunk the DatabaseHoney I Shrunk the Database
Honey I Shrunk the Database
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
 
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 

More from Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 

Recently uploaded

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 

Scalable And Incremental Data Profiling With Spark

  • 1. Scalable and Incremental Data Profiling with Spark Amelia Arbisser Adam Silberstein
  • 2. Trifacta: Self-service data preparation What data analysts hope to achieve in data projects Analysis & Consumption MACHINE LEARNING VISUALIZATION STATISTICS
  • 3. Trifacta: Self-service data preparation What data analysts hope to achieve in data projects 80% of time spent cleaning and preparing the data to be analyzed Analysis & Consumption MACHINE LEARNING VISUALIZATION STATISTICS
  • 4. Trifacta: Self-service data preparation We speed up and take the pain out of preparation
  • 7. Predictive Interaction and Immediate Feedback
  • 12.
  • 13.
  • 14. Where Profiling Fits In We validate preparation via profiling
  • 15. Spark Profiling at Trifacta ➔ Profiling results of transformation at scale ➔ Validation through profiles ➔ Challenges ➔Scale ➔Automatic job generation ➔ Our solution ➔Spark profiling job server ➔JSON spec ➔Pay-as-you-go
  • 16. The Case for Profiling
  • 18. Job Results ➔ Even clean raw data is not informative. ➔Especially when it is too large to inspect manually. ➔Need a summary representation. ➔Statistical and visual ➔ Generate profile programmatically.
  • 19. Visual Result Validation ➔ Similar to the profiling we saw above the data grid. ➔ Applied at scale, to the full result. ➔ Reveals mismatched, missing values.
  • 20. Visual Result Validation ➔ Similar to the profiling we saw above the data grid. ➔ Applied at scale, to the full result. ➔ Reveals mismatched, missing values.
  • 23. The Goal: Profiling for Every Job ➔ We’ve talked about the value of profiling, but... ➔Profiling is not a tablestakes feature, at least not on day 1. ➔Don’t want our users to disable it! ➔Profiling is potentially more expensive than transformation.
  • 24. Profiling at Scale ➔ Large data volume ➔Long running times ➔Memory constraints
  • 25. Performance vs. Accuracy ➔Approximation brings performance gains while still meeting profiler summarization requirements. ➔Off-the-shelf libraries (count-min-sketch, T-digest) great for approximating counts and non-distributive stats. ➔Not a silver bullet though...sometimes approximations are confusing.
  • 26. Flexible Job Generation ➔Profile for all users, all use cases: not a one-off ➔Diverse schemas ➔Any number of columns ➔ Calculate statistics for all data types ➔Numeric vs. categorical ➔Container types (maps, arrays,...) ➔User-defined
  • 27. Flexible Job Generation ➔Profile for all users, all use cases: not a one-off ➔Diverse schemas ➔Any number of columns ➔ Calculate statistics for all data types ➔Numeric vs. categorical ➔Container types (maps, arrays,...) ➔User-defined
  • 28. Transformation vs. Profiling ➔ Transformation is primarily row-based. ➔ Profiling is column-based. ➔ Different execution concerns.
  • 29. Solution: Spark in the Trifacta System
  • 30. Why Spark ➔ Effective at OLAP ➔ Flexible enough for UDFs ➔ Not tied to distributions ➔ Mass adoption ➔ Low latency
  • 31. Spark Profile Jobs Server UI Webapp Batch Server Profiler Service Client Server Cluster HDFS ➔ We built a Spark job server ➔ Automatically profiles output of transform jobs ➔ Outputs profiling results to HDFS ➔ Renders graphs in Trifacta product Batch ServerBatSBatS Workers
  • 32. Profiler Service ➔ Which columns to profile and their types ➔ Uses Trifacta type system, eg: ➔Dates ➔Geography ➔User-defined ➔Requested profiling statistics ➔Histograms ➔Outliers ➔Empty, invalid, valid ➔ Extensible ➔Pairwise statistics ➔User-defined functions Profiler Service JSON Specification Wrangled Data
  • 33. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 34. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 35. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 36. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 37. Performance Improvements ➔ A few troublesome use cases: ➔High cardinality ➔Non-distributive stats ➔ Huge gains for large numbers of columns Few Columns Many Columns High Cardinality 2X 4X Low Cardinality 10X 20X Spark profiling speed-up vs. MapReduce
  • 38. Pay-as-you-go Profiling ➔ Users do not always need profiles for all columns ➔ More complex and expensive statistics ➔ Drilldown ➔ Iterative exploration