Dunning - SIGMOD - Data Economy.pptx

T
Ted DunningSoftware Engineer at MapR Technologies
FROM ROOTS TO FRUITS: EXPLORING
LINEAGE FOR DATASET
RECOMMENDATIONS
Ted Dunning, Fellow, HPE
18 June, 2023
the meaning of words lies in their use
2
the meaning of words lies in their use
3
the meaning of data lies in its use
(apologies to Dr. Wittgenstein)
4
A meteorologist’s data
- rainfall
- windspeed
- temperature
5
A meteorologist’s data
- rainfall
- windspeed
- temperature
A business uses the
data to predict umbrella
sales
6
What does the data
actually mean?
7
What does the data
actually mean?
the meaning of data lies in its use
TRAINING PROCESS
8
README
URL
History
Datasets
+
Models
Metadata
We start with explicit metadata.
Examples: column and table
names, documentation, common
values, and others
TRAINING PROCESS
9
README
URL
History
Datasets
+
Models
Metadata
This is encoded as a large
artifact x characters
incidence table
At this point, direct metadata
search is possible
TRAINING PROCESS
10
README
URL
History
Datasets
+
Models
Metadata
We augment with
metadata from all
ancestors and
descendants in
the global data
lineage graph
TRAINING PROCESS
11
README
URL
History
Datasets
+
Models
Metadata
Finally, we reduce the characteristic
cooccurrences using indicator-based
recommendation methods.
A NOTE ON IMPLICATIONS
12
The characteristic indicator
matrix is what connects
“umbrella” with “rainfall” or
“mosquito” with
“temperature” + “windspeed”
QUERY PROCESS
13
The original query is often
textual, possibly a README
QUERY PROCESS
14
augmented by recent project
behavior (queries, references)
QUERY PROCESS
15
The query is expanded based
on indicators (when they say
“umbrellas” they also mean
“rainfall”)
as well as semantic token
embedding using BERT
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
16
The final results include an
explanation of why files or
programs are included.
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
17
EVALUATION
• Evaluation is difficult due to a lack of public datasets
• Most machine learning examples are truncated to final steps
• Very few non-machine learning pipelines exist outside of toy examples
• Private datasets generally cannot be shared
• Still important to use when possible due to scale
• Evaluation of recommendation engines is a subtle art
• Their purpose is to change behaviors
• Todays recommendations select tomorrow’s training data
• We aren’t to this point yet, this would be a symptom of success
18
EVALUATION
19
EVALUATION
20
THANK YOU
ted.dunning@hpe.com
@ted_dunning
@ted_dunning@mastodon.social
21
1 of 21

More Related Content

Similar to Dunning - SIGMOD - Data Economy.pptx(20)

Stream Processing Stream Processing
Stream Processing
FogGuru MSCA Project60 views
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool433 views
Cognitive dataCognitive data
Cognitive data
Sören Auer1.9K views
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data Science
Institute of Contemporary Sciences118 views
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6837 views
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger25 views
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
Paolo Missier659 views
KBART update ER&L 2009KBART update ER&L 2009
KBART update ER&L 2009
Jason Price, PhD426 views
ER&L KBART UpdateER&L KBART Update
ER&L KBART Update
Jason Price, PhD427 views
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown1.5K views
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl5K views
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc536 views
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra549 views
Data management for TA'sData management for TA's
Data management for TA's
aaroncollie576 views
NIH BD2K DataMed model, DATSNIH BD2K DataMed model, DATS
NIH BD2K DataMed model, DATS
Susanna-Assunta Sansone542 views

Recently uploaded(20)

krishnashamuktikendra.pdfkrishnashamuktikendra.pdf
krishnashamuktikendra.pdf
gagankrish8 views
2022-Scripting_Hacks.pdf2022-Scripting_Hacks.pdf
2022-Scripting_Hacks.pdf
Roland Schock9 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm284 views
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej5 views
Project Summary M_Covricova.pdfProject Summary M_Covricova.pdf
Project Summary M_Covricova.pdf
MARIACOVRICOVA16 views
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar6 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra9 views
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia12 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika15 views
Personal brand explorationPersonal brand exploration
Personal brand exploration
KyleeGarciaDean19 views
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 11 views
King_Ishmael_DMBS_PB1King_Ishmael_DMBS_PB1
King_Ishmael_DMBS_PB1
imking1115 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4915 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann40 views

Dunning - SIGMOD - Data Economy.pptx