SlideShare a Scribd company logo
Brokering Data:
Accelerating Data
Evaluation with
Databricks White Label
Vinoo Ganesh, Chief Technology Officer, Veraset
Nick Chmura, Director of Data Engineering, Veraset
Agenda
§ Background
§ Session Goals
§ The Data Ecosystem
§ Data Primitives
§ Data Brokers + Challenges
§ The Broker’s Dilemma
§ Demo
Background
• About Us
• Vinoo Ganesh (CTO, Veraset)
• Nick Chmura (Director of Data Engineering, Veraset)
• Data-As-A-Service Startup
• Anonymized Geospatial Data
• Model Training at Scale
• Heavily used during COVID-19 Investigations / analyses
• Process, Analyze, and Deliver >2 PB Data Yearly
• Data is Our Product
• We don’t build analytical tools / visualizations
• Optimized data storage, retrieval, and processing are our lifeblood
• “Just Data”
We’re Hiring!
vinoo@veraset.com
Session Goals
• Explain the Data Brokerage ecosystem
• Challenges of brokering data
• Concerns around sensitivity, privacy, and security
• Conceptual: “Agnostic” Brokerage
• Technology agnostic, system agnostic
• Differentiating data
• Demo: White Label
The Data Ecosystem
• Inundation of Data – clean + unclean
• “Data scientists spend about 45% of their time on data preparation tasks…” -
Datanami
• “Uptime” SLAs around Data
• Data scale increasing
• Co-Dependence: analytics + brokers
• Data sensitivity and privacy concerns
The oil that powers the engine.
Data Primitives
▪ Schema changes == API
breaks
▪ Format evolution == major
version bumps
▪ Uptime SLA
• Open vs proprietary data
formats
• Technology specific formats
• Formats focused on
individual workflows
• Presentation of data speaks
volumes
• Inherently opinionated
• Data == API
• Used + expensive data >
unused + cheap data
• Data graveyard
• Only as useful as it is easy to use
Enter: The Brokers
• Source, clean, package, distribute
data
• Impose the SLAs (Data Primitives)
on data
• Remove complexities to
operationalizing data
• Maintain the data
• Secure and protect the data
Goal: Make data easy to use
https://www.webfx.com/blog/wp-content/uploads/2019/09/worth.png
Challenges (1): Brokering Data
Making the data easy to use… isn’t easy.
• Tech Specifications (ie. File Format, Partitioning)
• Variable Compute, on customer side
• Right Tools / Environment / Libraries
• TTFB (time to first byte)
• Securing IP (the data)
• Product Differentiation
Challenges (2): Brokering Data
Making the data easy to use… isn’t easy.
• Live Data > Static Data, Live Compute > Static Compute
• Auditing Query Access
• Defining Quality Metrics / Creating A Narrative
• Security / Privacy
• Row-level permissions
The Broker’s Dilemma
The Broker’s Dilemma
• How can brokers demonstrate value, protect our IP,
differentiate our product, minimize technical setup,
ensure a cutting-edge privacy/security story, audit their
user’s access to data, and everything else we discussed…
all while making the data as easy to use?
The Solution: Veraset, Databricks, Privacera
• Analysis
• Data • Security/Privacy
The Solution – Databricks White Label
• A branded “White Labelled” Databricks instance
• Preloaded notebook on each instance
• Preloaded data on each instance
• Fine-grained access control (including read/write perms)
• Audit Trail
• Row-Level Security
Demo: White Labeled Databricks
Thank You!
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark Streaming
Databricks
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
Databricks
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
Databricks
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
Robert Sanders
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
RWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseRWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use Case
Databricks
 
Building A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation EngineBuilding A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation Engine
Databricks
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Data engineering
Data engineeringData engineering
Data engineering
Parimala Killada
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
Databricks
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Improving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsImproving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT Analytics
Databricks
 
Managing R&D Data on Parallel Compute Infrastructure
Managing R&D Data on Parallel Compute InfrastructureManaging R&D Data on Parallel Compute Infrastructure
Managing R&D Data on Parallel Compute Infrastructure
Databricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
Advanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using ML
Databricks
 
Detecting Anomalous Behavior with Surveillance​ Analytics​
Detecting Anomalous Behavior with Surveillance​ Analytics​Detecting Anomalous Behavior with Surveillance​ Analytics​
Detecting Anomalous Behavior with Surveillance​ Analytics​
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 

What's hot (20)

Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark Streaming
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
RWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseRWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use Case
 
Building A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation EngineBuilding A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation Engine
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Data engineering
Data engineeringData engineering
Data engineering
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
Improving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsImproving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT Analytics
 
Managing R&D Data on Parallel Compute Infrastructure
Managing R&D Data on Parallel Compute InfrastructureManaging R&D Data on Parallel Compute Infrastructure
Managing R&D Data on Parallel Compute Infrastructure
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Advanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using MLAdvanced Model Comparison and Automated Deployment Using ML
Advanced Model Comparison and Automated Deployment Using ML
 
Detecting Anomalous Behavior with Surveillance​ Analytics​
Detecting Anomalous Behavior with Surveillance​ Analytics​Detecting Anomalous Behavior with Surveillance​ Analytics​
Detecting Anomalous Behavior with Surveillance​ Analytics​
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 

Similar to Brokering Data: Accelerating Data Evaluation with Databricks White Label

21.06.2017 - KYOS Breakfast Event
21.06.2017 - KYOS Breakfast Event 21.06.2017 - KYOS Breakfast Event
21.06.2017 - KYOS Breakfast Event
Kyos
 
Company profile
Company profileCompany profile
Company profile
CDS
 
What Does a Full Featured Security Strategy Look Like?
What Does a Full Featured Security Strategy Look Like?What Does a Full Featured Security Strategy Look Like?
What Does a Full Featured Security Strategy Look Like?
Precisely
 
MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...
MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...
MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...
MongoDB
 
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA
 
Security Architecture Best Practices for SaaS Applications
Security Architecture Best Practices for SaaS ApplicationsSecurity Architecture Best Practices for SaaS Applications
Security Architecture Best Practices for SaaS Applications
Techcello
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
Inside Analysis
 
gkkCloudtechnologyassociate(cta)day 2
gkkCloudtechnologyassociate(cta)day 2gkkCloudtechnologyassociate(cta)day 2
gkkCloudtechnologyassociate(cta)day 2
Anne Starr
 
Understanding Zero Trust Security for IBM i
Understanding Zero Trust Security for IBM iUnderstanding Zero Trust Security for IBM i
Understanding Zero Trust Security for IBM i
Precisely
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services Marketplace
Denodo
 
"Data Mesh in Kubernetes", Andrii Syniuk
"Data Mesh in Kubernetes", Andrii Syniuk"Data Mesh in Kubernetes", Andrii Syniuk
"Data Mesh in Kubernetes", Andrii Syniuk
Fwdays
 
Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)
Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)
Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)
Precisely
 
Data Leakage Prevention
Data Leakage PreventionData Leakage Prevention
Andy Malone - Microsoft office 365 security deep dive
Andy Malone - Microsoft office 365 security deep diveAndy Malone - Microsoft office 365 security deep dive
Andy Malone - Microsoft office 365 security deep dive
Nordic Infrastructure Conference
 
Brochure quiterian DDWeb
Brochure quiterian DDWebBrochure quiterian DDWeb
Brochure quiterian DDWeb
Josep Arroyo
 
Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control	Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control
DBmaestro - Database DevOps
 
Company Profile CDS Services
Company Profile CDS ServicesCompany Profile CDS Services
Company Profile CDS Services
Vikas Sachdeva
 
DSS.LV - Principles Of Data Protection - March2015 By Arturs Filatovs
DSS.LV - Principles Of Data Protection - March2015 By Arturs FilatovsDSS.LV - Principles Of Data Protection - March2015 By Arturs Filatovs
DSS.LV - Principles Of Data Protection - March2015 By Arturs Filatovs
Andris Soroka
 
KASHTECH AND DENODO: ROI and Economic Value of Data Virtualization
KASHTECH AND DENODO: ROI and Economic Value of Data VirtualizationKASHTECH AND DENODO: ROI and Economic Value of Data Virtualization
KASHTECH AND DENODO: ROI and Economic Value of Data Virtualization
Denodo
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
Denodo
 

Similar to Brokering Data: Accelerating Data Evaluation with Databricks White Label (20)

21.06.2017 - KYOS Breakfast Event
21.06.2017 - KYOS Breakfast Event 21.06.2017 - KYOS Breakfast Event
21.06.2017 - KYOS Breakfast Event
 
Company profile
Company profileCompany profile
Company profile
 
What Does a Full Featured Security Strategy Look Like?
What Does a Full Featured Security Strategy Look Like?What Does a Full Featured Security Strategy Look Like?
What Does a Full Featured Security Strategy Look Like?
 
MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...
MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...
MongoDB .local London 2019: New Encryption Capabilities in MongoDB 4.2: A Dee...
 
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
 
Security Architecture Best Practices for SaaS Applications
Security Architecture Best Practices for SaaS ApplicationsSecurity Architecture Best Practices for SaaS Applications
Security Architecture Best Practices for SaaS Applications
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
 
gkkCloudtechnologyassociate(cta)day 2
gkkCloudtechnologyassociate(cta)day 2gkkCloudtechnologyassociate(cta)day 2
gkkCloudtechnologyassociate(cta)day 2
 
Understanding Zero Trust Security for IBM i
Understanding Zero Trust Security for IBM iUnderstanding Zero Trust Security for IBM i
Understanding Zero Trust Security for IBM i
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services Marketplace
 
"Data Mesh in Kubernetes", Andrii Syniuk
"Data Mesh in Kubernetes", Andrii Syniuk"Data Mesh in Kubernetes", Andrii Syniuk
"Data Mesh in Kubernetes", Andrii Syniuk
 
Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)
Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)
Protect Sensitive Data on Your IBM i (Social Distance Your IBM i/AS400)
 
Data Leakage Prevention
Data Leakage PreventionData Leakage Prevention
Data Leakage Prevention
 
Andy Malone - Microsoft office 365 security deep dive
Andy Malone - Microsoft office 365 security deep diveAndy Malone - Microsoft office 365 security deep dive
Andy Malone - Microsoft office 365 security deep dive
 
Brochure quiterian DDWeb
Brochure quiterian DDWebBrochure quiterian DDWeb
Brochure quiterian DDWeb
 
Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control	Protect your Database with Data Masking & Enforced Version Control
Protect your Database with Data Masking & Enforced Version Control
 
Company Profile CDS Services
Company Profile CDS ServicesCompany Profile CDS Services
Company Profile CDS Services
 
DSS.LV - Principles Of Data Protection - March2015 By Arturs Filatovs
DSS.LV - Principles Of Data Protection - March2015 By Arturs FilatovsDSS.LV - Principles Of Data Protection - March2015 By Arturs Filatovs
DSS.LV - Principles Of Data Protection - March2015 By Arturs Filatovs
 
KASHTECH AND DENODO: ROI and Economic Value of Data Virtualization
KASHTECH AND DENODO: ROI and Economic Value of Data VirtualizationKASHTECH AND DENODO: ROI and Economic Value of Data Virtualization
KASHTECH AND DENODO: ROI and Economic Value of Data Virtualization
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 

Recently uploaded (20)

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 

Brokering Data: Accelerating Data Evaluation with Databricks White Label

  • 1. Brokering Data: Accelerating Data Evaluation with Databricks White Label Vinoo Ganesh, Chief Technology Officer, Veraset Nick Chmura, Director of Data Engineering, Veraset
  • 2. Agenda § Background § Session Goals § The Data Ecosystem § Data Primitives § Data Brokers + Challenges § The Broker’s Dilemma § Demo
  • 3. Background • About Us • Vinoo Ganesh (CTO, Veraset) • Nick Chmura (Director of Data Engineering, Veraset) • Data-As-A-Service Startup • Anonymized Geospatial Data • Model Training at Scale • Heavily used during COVID-19 Investigations / analyses • Process, Analyze, and Deliver >2 PB Data Yearly • Data is Our Product • We don’t build analytical tools / visualizations • Optimized data storage, retrieval, and processing are our lifeblood • “Just Data” We’re Hiring! vinoo@veraset.com
  • 4. Session Goals • Explain the Data Brokerage ecosystem • Challenges of brokering data • Concerns around sensitivity, privacy, and security • Conceptual: “Agnostic” Brokerage • Technology agnostic, system agnostic • Differentiating data • Demo: White Label
  • 5. The Data Ecosystem • Inundation of Data – clean + unclean • “Data scientists spend about 45% of their time on data preparation tasks…” - Datanami • “Uptime” SLAs around Data • Data scale increasing • Co-Dependence: analytics + brokers • Data sensitivity and privacy concerns The oil that powers the engine.
  • 6. Data Primitives ▪ Schema changes == API breaks ▪ Format evolution == major version bumps ▪ Uptime SLA • Open vs proprietary data formats • Technology specific formats • Formats focused on individual workflows • Presentation of data speaks volumes • Inherently opinionated • Data == API • Used + expensive data > unused + cheap data • Data graveyard • Only as useful as it is easy to use
  • 7. Enter: The Brokers • Source, clean, package, distribute data • Impose the SLAs (Data Primitives) on data • Remove complexities to operationalizing data • Maintain the data • Secure and protect the data Goal: Make data easy to use https://www.webfx.com/blog/wp-content/uploads/2019/09/worth.png
  • 8. Challenges (1): Brokering Data Making the data easy to use… isn’t easy. • Tech Specifications (ie. File Format, Partitioning) • Variable Compute, on customer side • Right Tools / Environment / Libraries • TTFB (time to first byte) • Securing IP (the data) • Product Differentiation
  • 9. Challenges (2): Brokering Data Making the data easy to use… isn’t easy. • Live Data > Static Data, Live Compute > Static Compute • Auditing Query Access • Defining Quality Metrics / Creating A Narrative • Security / Privacy • Row-level permissions
  • 11. The Broker’s Dilemma • How can brokers demonstrate value, protect our IP, differentiate our product, minimize technical setup, ensure a cutting-edge privacy/security story, audit their user’s access to data, and everything else we discussed… all while making the data as easy to use?
  • 12. The Solution: Veraset, Databricks, Privacera • Analysis • Data • Security/Privacy
  • 13. The Solution – Databricks White Label • A branded “White Labelled” Databricks instance • Preloaded notebook on each instance • Preloaded data on each instance • Fine-grained access control (including read/write perms) • Audit Trail • Row-Level Security
  • 14. Demo: White Labeled Databricks
  • 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.