SlideShare a Scribd company logo
DataHub: Collaborative Data
Science and Dataset Version
Management at Scale
Aditya Parameswaran
U Illinois 1
Deep, Dark Secrets of Data Science
2
Mo#va#on'
!  The'“pain'point”'is'increasingly'managing'the'process'
!  Which'datasets'are'being'used'and'where'did'they'come'from'
!  Who'is'edi#ng'what'or'who'generated'which'results'
!  What'types'of'analyses'have'been'conducted'
!  Where'did'this'“plot.png”'file'come'from'
!  What'to'do'when'I'discover'an'error'in'a'dataset'
!  How'did'today’s'results'compare'to'yesterday’s'results'
!  Which'datasets'should'I'use'to'further'my'analysis'
!  Many'ad'hoc'data'management'systems'(e.g.,'Dropbox)'being'used'
!  Much'of'the'data'is'unstructured'so'typically'can’t'use'DBs'
!  The'process'of'data'science'itself'is'quite'ad'hoc'and'exploratory'
!  Scien#sts/researchers/analysts'are'preTy'much'on'their'own'
Courtesy: XKCD
How bad could dataset
management get?
3
4
Chicago Illinois Maryland MIT
Aaron
Elmore
Aditya
Parameswaran
Amol
Deshpande
Sam
Madden
Anant
Bhardwaj
The InvestigatorTeam
Amit
Chavan
Shouvik
Bhattacherjee
A True (Horror) Story of
Dataset Management
5
Before
What did we learn?
6
We use about 100TB of data across
20-30 researchers
We spend a LOT of money on this.
Everything is organized around shared
folders, and everyone has access.
Our dataset management scheme
is so simple, it’s great!
Research
Scientist
What did we learn?
7
They typically make a private copy.
Us
So how do users work on datasets?
But wouldn’t that mean lots of
redundant versions and duplication?
Yes. That’s why our storage is 100TB.
1: Massive redundancy
in stored datasets
What did we learn?
8
Sure, but we have no way of knowing
or resolving modifications
Us
Do you have datasets being analyzed
by multiple users simultaneously?
But wouldn’t that mean you cannot
combine work across users
True.The users will need to discuss.
II:True collaboration
is near impossible!
What did we learn?
9
All the time!
Us
Do you get rid of redundant datasets,
given that you have space issues?
What if the user had left, and if the
dataset is crucial for reproducibility?
We cross our fingers!
III: Unknown
dependencies
between datasets
What did we learn?
10
Not really.They talk to me.
Us
Is there any way users can search for
specific dataset versions of interest?
What if you leave?
Let’s pray for the group’s sake that that
doesn’t happen!
IV: No organization
or management of
dataset versions.
What did we learn?
11
1.  Massive redundancy in stored datasets
2.  Truly collaborative data science is impossible
3.  Unknown dependencies between dataset versions
4.  No efficient organization or management of datasets
The four
Happens all the time…
12
1.  Massive redundancy in stored datasets
2.  Truly collaborative data science is impossible
3.  Unknown dependencies between dataset versions
4.  No efficient organization or management of datasets
Every collaborative data science project ends up in
dataset version management hell
Surely, there must
be a better way?
Have we seen this before?
13
Analogous to management of source code
before source code version control!
How about:
DataHub: a “GitHub for data”
1.  Massive redundancy in stored datasets
2.  Truly collaborative data science is impossible
3.  Unknown dependencies between versions
4.  No efficient organization or management
Compact storage
“Branching” allowed
Explicit and implicit
Rich retrieval methods
Solving the “AYS” problems
What about alternatives?
14
Many issues with directly using GitHub or SC-VC:
•  Cannot handle large datasets or large # of versions
•  Querying and retrieval functionality is primitive
•  Datasets have regular repeating structure
Many issues with temporal databases: similar issues, plus
one major one:
•  Only supports a linear chain of versions
The Vision for DataHub
15
The
for collaborative data science and
dataset version management
satisfying all your dataset book-keeping needs.
The Vision for DataHub
16
Basics:
•  Efficient maintenance and management of
dataset versions
DataHub will also have:
•  A rich query language encompassing data and
versions
•  In-built essential data science functionality such as
ingestion, and integration, plus API hooks to
external apps (MATLAB, R, …)
17
Ingest (Import)
Version
Management
Sharing, Collaboration
Raw Files
Fork, Branch,
Merge
Database System
Query Language
Integrate / Visualize / Other Apps
DataHub Architecture
18
Data:
Versioned
Datasets
Metadata:
Version Graphs
Indexes,
Provenance
Dataset Versioning Manager
Versioning API Versioning QL
INGEST INTEGRATE OTHER Client Applications
Client Applications
DataHub: A Collaborative Dataset Management Platform
Support for Data Science
Data Model and Basic API
19
Key Value
Sam (Berkeley, 2003, Hellerstein)
Amol (Berkeley, 2004, Hellerstein)
Aaron (UCSB, 2014, El Abbadi and
Agrawal)
Key School Year Advisor
Sam Berkeley 2003 Hellerstein
Amol Berkeley 2004 Hellerstein
Aaron UCSB 2014 El Abbadi and
Agrawal
Flexible “Schema-later” Data Model
Groups of records with different schemas in same table
Standard git commands: branch, commit, fork,
merge, rollback, checkout
Versions
Metadata
Storing and Retrieving Versions
20
Version 0
Sam, $50, 1
Amol, $100, 1
Master
+ Mike, $150, 1
Version 1
+ Aditya, $80, 1
Version 1.1
+ Amol, $100, 0
T1
T2 T3
T4
visible bit
Deletes Amol
e 3: Example of relational tables created to encode 4 ver-
with deletion bits.
dition to VQL, which is a SQL-like language, DSVC will
upport a collection of flexible operators for record splitting
tring manipulation, including regex functionality, similarity
h, and other operations to support the data cleaning engine, as
Simplest Strawman Approach:
Store: For every version, store “delta” from previous DAG version
Retrieve: Start from version pointer, walk up to root
The Good:
•  Somewhat Compact
The Bad:
•  Inefficient to construct versions
Walk up entire chains
•  Inefficient to look up all versions
that contain a tuple
Q:Why store delta from the previous version?
Q:Why not materialize some versions completely?
Q:What kind of indexes should we use?
Branching and Merging
21
More questions than answers!
•  Q: How do we allow users operate on servers and/or their
local machines without missing updates?
•  Q:What if the datasets are large? Can users work on
samples?
•  Q: How do we detect conflicts and allow users to merge
conflicting branches with as little effort as possible?
Rich Query Language
22
Can combine versions and data!
SELECT * FROM R[V1], R[V4] WHERE R[V1].ID = R[V4].ID
SELECT VNUM FROM VERSIONS(R) WHERE EXISTS
(SELECT * FROM R[VNUM] WHERE NAME=‘AARON’)
Other examples: Find…
•  All versions that are vastly different in size from a given version.
•  The first version where a certain tuple was introduced
•  All tuples that were introduced in a given version and
subsequently deleted
Still a work in progress!
Screenshots
23
App: Ingest by Example
24
Example from
Data Wrangler
Paper
App: Automatic Visualization
25
Papers in the works..
• Fundamentals:
• Blobs: Exploring the trade-off between storage
and recreation/retrieval cost for blob stores
• Relational: Exploring SQL-based versioning
implementations and indexing
• Add-on functionality:
• Ingest: Ingest by example
• Viz: Automatically generating query visualizations
26
To Summarize
• Dataset management as of today is bad, bad, bad
• DataHub is “GitHub for data”; an essential prerequisite to
collaborative data science
•  Tracking, managing, reasoning about, and retrieving versions
•  Fundamental building block for study of other problems
• DataHub has in-built data science functionality, plus hooks
•  Ingestion: ingest by example
•  Integration: search, and auto-integrate
•  Provenance: explicit and implicit
•  Visualization: manual and automatic 27
Lots of related work!
Integrated with
versioned storage
To find out more and
contribute…
28
datahub.csail.mit.edu
Aditya Parameswaran
data-people.cs.illinois.edu

More Related Content

What's hot

Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
GoDataDriven
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
Matillion
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
DataWorks Summit
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
CitiusTech
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdf
Chitresh Kaushik
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Graph databases
Graph databasesGraph databases
Graph databases
Vinoth Kannan
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 

What's hot (20)

Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdf
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Graph databases
Graph databasesGraph databases
Graph databases
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 

Viewers also liked

Version Control and Continuous Integration
Version Control and Continuous IntegrationVersion Control and Continuous Integration
Version Control and Continuous Integration
Geff Henderson Chang
 
Open Source in Entperprises - A Presentation by SAP at OSCON 2014 Confernece
Open Source in Entperprises - A Presentation by SAP at OSCON 2014 ConferneceOpen Source in Entperprises - A Presentation by SAP at OSCON 2014 Confernece
Open Source in Entperprises - A Presentation by SAP at OSCON 2014 Confernece
sanjay4sap
 
Smart Grid Analytics: All That Remains to be Ready is You
Smart Grid Analytics: All That Remains to be Ready is YouSmart Grid Analytics: All That Remains to be Ready is You
Smart Grid Analytics: All That Remains to be Ready is You
Lauren Watters
 
Energy utilities cis and billing
Energy utilities cis and billingEnergy utilities cis and billing
Energy utilities cis and billingBiju M R
 
A brief introduction to version control systems
A brief introduction to version control systemsA brief introduction to version control systems
A brief introduction to version control systems
Tim Staley
 
Datahub-infotilaisuus 31.5.2016
Datahub-infotilaisuus 31.5.2016Datahub-infotilaisuus 31.5.2016
Datahub-infotilaisuus 31.5.2016
Fingrid Oyj
 
What is version control software and why do you need it?
What is version control software and why do you need it?What is version control software and why do you need it?
What is version control software and why do you need it?
Leonid Mamchenkov
 
Introduction to Version Control and Configuration Management
Introduction to Version Control and Configuration ManagementIntroduction to Version Control and Configuration Management
Introduction to Version Control and Configuration ManagementPhilip Johnson
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
Introduction to Version Control
Introduction to Version ControlIntroduction to Version Control
Introduction to Version Control
Jeremy Coates
 
SAP Hybris Marketing - Cosmin Costea
SAP Hybris Marketing - Cosmin CosteaSAP Hybris Marketing - Cosmin Costea
SAP Hybris Marketing - Cosmin Costea
Institutul de Marketing
 
Engineering Energy& Utilities
Engineering  Energy& UtilitiesEngineering  Energy& Utilities
Engineering Energy& Utilitiesstmobili
 
Managing Requirements in Agile Development - Best Practices for Tool-Based Re...
Managing Requirements in Agile Development - Best Practices for Tool-Based Re...Managing Requirements in Agile Development - Best Practices for Tool-Based Re...
Managing Requirements in Agile Development - Best Practices for Tool-Based Re...
pd7.group
 
Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...
Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...
Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...youngculture
 

Viewers also liked (15)

Version Control and Continuous Integration
Version Control and Continuous IntegrationVersion Control and Continuous Integration
Version Control and Continuous Integration
 
Open Source in Entperprises - A Presentation by SAP at OSCON 2014 Confernece
Open Source in Entperprises - A Presentation by SAP at OSCON 2014 ConferneceOpen Source in Entperprises - A Presentation by SAP at OSCON 2014 Confernece
Open Source in Entperprises - A Presentation by SAP at OSCON 2014 Confernece
 
eMeter
eMetereMeter
eMeter
 
Smart Grid Analytics: All That Remains to be Ready is You
Smart Grid Analytics: All That Remains to be Ready is YouSmart Grid Analytics: All That Remains to be Ready is You
Smart Grid Analytics: All That Remains to be Ready is You
 
Energy utilities cis and billing
Energy utilities cis and billingEnergy utilities cis and billing
Energy utilities cis and billing
 
A brief introduction to version control systems
A brief introduction to version control systemsA brief introduction to version control systems
A brief introduction to version control systems
 
Datahub-infotilaisuus 31.5.2016
Datahub-infotilaisuus 31.5.2016Datahub-infotilaisuus 31.5.2016
Datahub-infotilaisuus 31.5.2016
 
What is version control software and why do you need it?
What is version control software and why do you need it?What is version control software and why do you need it?
What is version control software and why do you need it?
 
Introduction to Version Control and Configuration Management
Introduction to Version Control and Configuration ManagementIntroduction to Version Control and Configuration Management
Introduction to Version Control and Configuration Management
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
Introduction to Version Control
Introduction to Version ControlIntroduction to Version Control
Introduction to Version Control
 
SAP Hybris Marketing - Cosmin Costea
SAP Hybris Marketing - Cosmin CosteaSAP Hybris Marketing - Cosmin Costea
SAP Hybris Marketing - Cosmin Costea
 
Engineering Energy& Utilities
Engineering  Energy& UtilitiesEngineering  Energy& Utilities
Engineering Energy& Utilities
 
Managing Requirements in Agile Development - Best Practices for Tool-Based Re...
Managing Requirements in Agile Development - Best Practices for Tool-Based Re...Managing Requirements in Agile Development - Best Practices for Tool-Based Re...
Managing Requirements in Agile Development - Best Practices for Tool-Based Re...
 
Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...
Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...
Developing enterprise ecommerce solutions using hybris by Drazen Nikolic - Be...
 

Similar to DataHub

How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
Samet KILICTAS
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
dallemang
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
ArangoDB Database
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015
Chip Childers
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
Oscar Corcho
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
DataWorks Summit
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
Russell Jurney
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Stavros Papadopoulos
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
Chris Dwan
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Anita de Waard
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Ontologies & linked open data
Ontologies & linked open dataOntologies & linked open data
Ontologies & linked open data
João Rocha da Silva
 

Similar to DataHub (20)

How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Ontologies & linked open data
Ontologies & linked open dataOntologies & linked open data
Ontologies & linked open data
 

Recently uploaded

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

DataHub

  • 1. DataHub: Collaborative Data Science and Dataset Version Management at Scale Aditya Parameswaran U Illinois 1
  • 2. Deep, Dark Secrets of Data Science 2 Mo#va#on' !  The'“pain'point”'is'increasingly'managing'the'process' !  Which'datasets'are'being'used'and'where'did'they'come'from' !  Who'is'edi#ng'what'or'who'generated'which'results' !  What'types'of'analyses'have'been'conducted' !  Where'did'this'“plot.png”'file'come'from' !  What'to'do'when'I'discover'an'error'in'a'dataset' !  How'did'today’s'results'compare'to'yesterday’s'results' !  Which'datasets'should'I'use'to'further'my'analysis' !  Many'ad'hoc'data'management'systems'(e.g.,'Dropbox)'being'used' !  Much'of'the'data'is'unstructured'so'typically'can’t'use'DBs' !  The'process'of'data'science'itself'is'quite'ad'hoc'and'exploratory' !  Scien#sts/researchers/analysts'are'preTy'much'on'their'own' Courtesy: XKCD
  • 3. How bad could dataset management get? 3
  • 4. 4 Chicago Illinois Maryland MIT Aaron Elmore Aditya Parameswaran Amol Deshpande Sam Madden Anant Bhardwaj The InvestigatorTeam Amit Chavan Shouvik Bhattacherjee
  • 5. A True (Horror) Story of Dataset Management 5 Before
  • 6. What did we learn? 6 We use about 100TB of data across 20-30 researchers We spend a LOT of money on this. Everything is organized around shared folders, and everyone has access. Our dataset management scheme is so simple, it’s great! Research Scientist
  • 7. What did we learn? 7 They typically make a private copy. Us So how do users work on datasets? But wouldn’t that mean lots of redundant versions and duplication? Yes. That’s why our storage is 100TB. 1: Massive redundancy in stored datasets
  • 8. What did we learn? 8 Sure, but we have no way of knowing or resolving modifications Us Do you have datasets being analyzed by multiple users simultaneously? But wouldn’t that mean you cannot combine work across users True.The users will need to discuss. II:True collaboration is near impossible!
  • 9. What did we learn? 9 All the time! Us Do you get rid of redundant datasets, given that you have space issues? What if the user had left, and if the dataset is crucial for reproducibility? We cross our fingers! III: Unknown dependencies between datasets
  • 10. What did we learn? 10 Not really.They talk to me. Us Is there any way users can search for specific dataset versions of interest? What if you leave? Let’s pray for the group’s sake that that doesn’t happen! IV: No organization or management of dataset versions.
  • 11. What did we learn? 11 1.  Massive redundancy in stored datasets 2.  Truly collaborative data science is impossible 3.  Unknown dependencies between dataset versions 4.  No efficient organization or management of datasets The four
  • 12. Happens all the time… 12 1.  Massive redundancy in stored datasets 2.  Truly collaborative data science is impossible 3.  Unknown dependencies between dataset versions 4.  No efficient organization or management of datasets Every collaborative data science project ends up in dataset version management hell Surely, there must be a better way?
  • 13. Have we seen this before? 13 Analogous to management of source code before source code version control! How about: DataHub: a “GitHub for data” 1.  Massive redundancy in stored datasets 2.  Truly collaborative data science is impossible 3.  Unknown dependencies between versions 4.  No efficient organization or management Compact storage “Branching” allowed Explicit and implicit Rich retrieval methods Solving the “AYS” problems
  • 14. What about alternatives? 14 Many issues with directly using GitHub or SC-VC: •  Cannot handle large datasets or large # of versions •  Querying and retrieval functionality is primitive •  Datasets have regular repeating structure Many issues with temporal databases: similar issues, plus one major one: •  Only supports a linear chain of versions
  • 15. The Vision for DataHub 15 The for collaborative data science and dataset version management satisfying all your dataset book-keeping needs.
  • 16. The Vision for DataHub 16 Basics: •  Efficient maintenance and management of dataset versions DataHub will also have: •  A rich query language encompassing data and versions •  In-built essential data science functionality such as ingestion, and integration, plus API hooks to external apps (MATLAB, R, …)
  • 17. 17 Ingest (Import) Version Management Sharing, Collaboration Raw Files Fork, Branch, Merge Database System Query Language Integrate / Visualize / Other Apps
  • 18. DataHub Architecture 18 Data: Versioned Datasets Metadata: Version Graphs Indexes, Provenance Dataset Versioning Manager Versioning API Versioning QL INGEST INTEGRATE OTHER Client Applications Client Applications DataHub: A Collaborative Dataset Management Platform Support for Data Science
  • 19. Data Model and Basic API 19 Key Value Sam (Berkeley, 2003, Hellerstein) Amol (Berkeley, 2004, Hellerstein) Aaron (UCSB, 2014, El Abbadi and Agrawal) Key School Year Advisor Sam Berkeley 2003 Hellerstein Amol Berkeley 2004 Hellerstein Aaron UCSB 2014 El Abbadi and Agrawal Flexible “Schema-later” Data Model Groups of records with different schemas in same table Standard git commands: branch, commit, fork, merge, rollback, checkout Versions Metadata
  • 20. Storing and Retrieving Versions 20 Version 0 Sam, $50, 1 Amol, $100, 1 Master + Mike, $150, 1 Version 1 + Aditya, $80, 1 Version 1.1 + Amol, $100, 0 T1 T2 T3 T4 visible bit Deletes Amol e 3: Example of relational tables created to encode 4 ver- with deletion bits. dition to VQL, which is a SQL-like language, DSVC will upport a collection of flexible operators for record splitting tring manipulation, including regex functionality, similarity h, and other operations to support the data cleaning engine, as Simplest Strawman Approach: Store: For every version, store “delta” from previous DAG version Retrieve: Start from version pointer, walk up to root The Good: •  Somewhat Compact The Bad: •  Inefficient to construct versions Walk up entire chains •  Inefficient to look up all versions that contain a tuple Q:Why store delta from the previous version? Q:Why not materialize some versions completely? Q:What kind of indexes should we use?
  • 21. Branching and Merging 21 More questions than answers! •  Q: How do we allow users operate on servers and/or their local machines without missing updates? •  Q:What if the datasets are large? Can users work on samples? •  Q: How do we detect conflicts and allow users to merge conflicting branches with as little effort as possible?
  • 22. Rich Query Language 22 Can combine versions and data! SELECT * FROM R[V1], R[V4] WHERE R[V1].ID = R[V4].ID SELECT VNUM FROM VERSIONS(R) WHERE EXISTS (SELECT * FROM R[VNUM] WHERE NAME=‘AARON’) Other examples: Find… •  All versions that are vastly different in size from a given version. •  The first version where a certain tuple was introduced •  All tuples that were introduced in a given version and subsequently deleted Still a work in progress!
  • 24. App: Ingest by Example 24 Example from Data Wrangler Paper
  • 26. Papers in the works.. • Fundamentals: • Blobs: Exploring the trade-off between storage and recreation/retrieval cost for blob stores • Relational: Exploring SQL-based versioning implementations and indexing • Add-on functionality: • Ingest: Ingest by example • Viz: Automatically generating query visualizations 26
  • 27. To Summarize • Dataset management as of today is bad, bad, bad • DataHub is “GitHub for data”; an essential prerequisite to collaborative data science •  Tracking, managing, reasoning about, and retrieving versions •  Fundamental building block for study of other problems • DataHub has in-built data science functionality, plus hooks •  Ingestion: ingest by example •  Integration: search, and auto-integrate •  Provenance: explicit and implicit •  Visualization: manual and automatic 27 Lots of related work! Integrated with versioned storage
  • 28. To find out more and contribute… 28 datahub.csail.mit.edu Aditya Parameswaran data-people.cs.illinois.edu