1. Curation as Programming: AI,
Data Management, and Mediated
Knowledge Interaction
Bill Howe, Ph.D.
Associate Professor, Information School
Co-Director, Responsible AI Systems & Experiences
Adjunct Associate Professor, Allen School of Computer Science & Engineering
Adjunct Associate Professor, Electrical Engineering
University of Washington
1
2. A view of LLMs from 6 months ago…
• Bigger is different
• “Emergent” capabilities
• A handful of “Foundation Models” can do
any task; no need for specialized
models
• Re-training is too expensive anyway
• No way to compete with
OpenAI/Microsoft, and to a lesser extent
Google Bill Howe, UW 2
4. 4
Open Source models gaining on massive private models
Some enablers:
• Chincilla: 70B params over 4X data outperforms 280B params [1]
• LoRA: freeze pre-trained model, add trainable rank decomposition matrices to
each layer, greatly reducing training params for specialization [2]
• 4-bit quantization (replace every param with 4 bits) appears competitive [3]
• Training on small, specialized datasets offers better results (e.g., [4])
[1] Compute-optimal large language models, https://arxiv.org/abs/2203.15556
[2] LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685
[3] Quantization reduce accuracy in favor of memory size and inference latency https://arxiv.org/abs/2212.09720
[4] Koala: ~1M dialogue examples competitive with massive scrapes https://bair.berkeley.edu/blog/2023/04/03/koala/
March 3 March 13 March 19
More info: https://www.semianalysis.com/p/google-we-have-no-moat-and-neither
In the last few months…
Vicuna: 13B params,
trained on 8 x A100
GPUs in one day
Bard: 137B
params, thousands
of petaflop-days
5. Bill Howe, UW 5
To create a regulatory barrier and
maintain competitive advantage
Prediction:
Open source models will remain competitive and available.
The curation and management of specialized, minimal, modular
datasets for training and evaluation will drive AI progress.
This is an opportunity for the data management community.
So now: Why are OpenAI and others
lobbying for regulation in the US?
8. simulations
internet
databases
generative models
people
Ex: most LLMs
massive, noisy data -> huge, general, yet unpredictable models
GPT 3.5: 175B parameters (800GB)
Bard: 137B parameters
GPT-4: ???B billion parameters
data AI
rlhf
scrapes
Curation-on-read: scrape an uncurated convenience
sample of the internet, then implement guardrails
with reinforcement learning from human feedback,
and/or by sanitizing inputs and outputs
10. Inputs Outputs
an armchair in the
shape of an
avocado…
DALL-E, OpenAI
A teapot in the
shape of a rubik’s
cube…
A shrimp with
sunglasses riding a
unicycle…
https://openai.com/blog/dall-e/
11. Bill Howe, UW 11
an armchair in the
shape of an
avocado…
A [X] year old girl… ? ? ?
12. Robert Wolfe Prof
Aylin Caliskan
FAccT 23
Yiwei Yang
Training data polluted by
objectified images of women
15. 15
50+ years
100s of millions USD
130k structures
https://www.nature.com/articles/s41586-021-03819-2/figures/1
AlphaFold uses 50 yrs of PDB
16. 16
Cubed sphere grid
U-Net CNN (fairly standard)
Dale Durran, UW
Jonathon Weyn, UW Rich Caruana,
Microsoft
Deep Learning for Weather Prediction
J Adv Model Earth Syst, Volume: 13, Issue: 7, First published: 25 June 2021, DOI: (10.1029/2021MS002502)
17. 17
RMSE at 500 mbar isopotential vs. forecast day
Full-physics model (lower res) (worse)
Deep Learning DWLP (U-Net, CNNs)
Full-physics model (comparable res) (better)
Full-physics model (very high res) (best)
Dale Durran, UW
Jonathon Weyn, UW Rich Caruana,
Microsoft
Deep Learning for Weather Prediction
3 min for 1-mo. ensemble (+ 2-3 days to train)
vs.
16 days for 1-mo. ensemble!
Deep Learning DWLP (high res, ensemble)
Full-physics model (very high res, ensemble)
(better, but not by much!)
JAMES 2020
JAMES 2021
19. AI for curation / curation for AI
• UrbanSynth: Curating City Data
– Spatiotemporal imputation
– Spatiotemporal disaggregation
– Spatiotemporal integration
– Information Extraction from Court Records
• Curating Viz Interaction Data
• Mitigating Bias w/ Curated Concept Sets
• SynRD: Evaluating Synthetic Data
• Learning from Curated Ontologies
19
SIGMOD 21, AAAI 20
ICML ws 23
VLDB 24
WWW22, NeurIPS 22
ArXiv 22
VLDB 24 subm.
20. Cities as complex systems
Cities as assemblies of
independent subsystems
Traffic speed prediction
(Liao et al., 2018)
Demand forecasting
(Uber, 2018)
Crowd flows
(Zhang et al., 2018)
Traffic accident prediction
(Yuan et al., 2018)
European Physical Journal 22
21. Curation is the bottleneck…
• Sometimes we have expert-curated supervision
– 50+ years of expert experimental curation via PDB enables AlphaFold
– 70+ years of physics-based fluid dynamics models enables DL for
weather
• But city data is incomplete, aggregated, disintegrated
• AI to curate city data
Bill Howe, UW 21
22. AI-enabled curation of city data
Reconstructed
Missing
Aggregated High-res
Heterogeneous Learned weights
3D
2D
1D
Bin Han
An Yan
SIGMOD 21
https://arxiv.org/abs/2301.04233
VLDB 24 (prep)
J Transp Geography
19
AAAI 20
24. Urban image inpainting https://arxiv.org/abs/2301.04233
Handle sparsity: Biased masking
Bin Han
25. Bill Howe, UW 28
Parade route anomaly –
no traffic, but unrealistic
12 PM during parade
4 PM after parade
Erase the anomaly and
synthesize new data
Low error in the
prediction
compared to a
non-parade day
https://arxiv.org/abs/2301.04233
Bin Han
26. Bill Howe, UW 30
New York divided into regions at different scales
Given aggregate data at this level
Learn to disaggregate data at these levels
(supervised by individual data or other variables)
Bin Han
VLDB 24 (prep)
27. 31
Idea: Spatially coherent architecture:
Align the model with the aggregation levels;
compute loss at each level
Use prediction at block level to
predict at tract level, and so on.
Force spatial coherence; learn all
aggregation levels simultaneously.
Different loss strategies: full
reconstruction, prior layer only,
bottom-up only.
Overall:
* Baseline NN outperforms all
classical methods
* Coherent architectures
outperforms baseline by a
significant margin
Bin Han
VLDB 24 (prep)
28. Encapsulating urban dynamics:
Learning reusable representations from multi-source data
bikeshare demand
+ ML Model
Prediction
Representation Z
An Yan
SIGMOD 21
31. Exogenous
features matter
Our “everything
at once” model
approximates
perfect variable
selection
Optimizing
for fairness
preserves
accuracy
Oracle-selected
variables
Single variable
Equitensors
(all variables)
33. Bill Howe, UW 38
VizDeck (SIGMOD 12, iConference 13): Generate lots of
visualizations directly from data properties and design rules, try
rules, try to learn from what users select.
(no real ML, because we didn’t know how to do it + bad vis libraries)
Viz Recommendation Trajectory…
Voyager 1 (Vis 15), 2 (Vis 17): Generate alternate visualizations, in
a principled way, from a user-created seed & report on how user
behavior changes
(no real ML, because I couldn’t convince Ham, Dom, and Jeff to do it :)
DRACO (Vis 18, Best paper): Generate visualizations according to
design rules in answer set programming, learn weights on rules
from data.
(a little ML, but we lacked good datasets for training)
Lilly (NLViz workshop 19): Vision for AI-assisted non-expert
analysis and storytelling.
(We didn’t build it then, now it’s much easier, but we still lack data)
35. Bill Howe, UW 41
Zening Qu
Multi-view consistency checks
Afford direct navigation
of the revision space.
Hover for preview, click
to set root node.
Curated interaction log
36. Bill Howe, UW 55
Surj: Ontological Learning for Fast, Accurate, and
Robust Hierarchical Multi-label Classification
Sean Yang
Problem: Classify into an
ontology
Simple idea:
Embed the ontology
Embed the data
Learn the mapping
SoTA results, better
robustness, much faster
GraphLearning@WWW 22
37. Bill Howe, UW 56
Sean Yang Bernease
Herman
Surj wins on almost all
existing benchmarks, but we
show the existing
benchmarks aren’t
measuring anything useful.
Tool to generate better
benchmarks
NeurIPS 22
impossible
trivial
good
38. 57
src: wikipedia
2nd to Last slide
As models become commoditized, AI reduces to data curation
A hundred flowers will blossom – many specialized models,
trained on many specialized datasets
“Curation” means filtering, cleaning, integrating multi-modal data
from a variety of sources: web, structured DBs, generative
models, simulations, crowdsourcing.
These are core strengths for the DB community: Queries are
curation
Databases are “training set management systems”
39. 58
src: wikipedia
Last slide
Examples of Open Questions:
• Querying for data near the decision boundary of a model
• Querying generative models at scale: Create a diverse set of
images with specific properties
• Dataset minimization: smaller dataset, same model performance
• Data-aware model minification
• Experiment management systems
• Synthetic data management systems
• Bias mitigation in queries: Return query results with specific
statistical properties
40. 59
Sean Yang
Bernease Herman
Bin Han Yiwei Yang
Maxim Grechkin Poshen Lee
Binbing Wen
graduated
Alumni
Year 1-3
Jackson Brown
Annie Yan
Shrainik Jain
41. Where we are…
• Sometimes we have highly curated data for supervision
– 50+ years of expert experimental curation via PDB enables
AlphaFold
– plus lots of domain expertise in the model itself
• Other times data is incomplete or disintegrated
– So: Use AI to replace missing data, disaggregate, and integrate to
complete the data record
– Also: Synthesize scenarios that don’t appear in the data record
• Improve robustness to unseen / OOD events
• Disasters, anomalies, hypothetical situations
• Other times the bias is more subtle
– Next: Unlearning bias in vision-language models
Bill Howe, UW 60
42. 61
Curation as programming in vision models
Vision models trained on convenience samples learn societal biases
Yiwei Yang
Goal: Penalize this bias during training, without human labeling
43. 62
A man sitting at a desk
with a laptop computer.
A woman sitting in front
of a laptop computer.
A man holding a tennis
racquet on a tennis court.
A man holding a tennis
racquet on a tennis court.
Right answer,
wrong reasons
Wrong answer,
wrong reasons
Proposed solution is
specific to gender
Two new loss terms:
1) reduce confidence
when gender
information is absent
2) increase confidence
when it is present
Women also Snowboard: Overcoming Bias in Captioning Models
Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach, CVPR 2018
44. Bill Howe, UW 63
nurse doctor doctor
Hypothetical ”good” visual explanations
Yiwei Yang
45. Bill Howe, UW 64
nurse doctor nurse
Hypothetical bad visual explanations
reflecting societal biases
Yiwei Yang
46. Bill Howe, UW 65
Key Idea: Penalize the model for using the wrong explanations
How?
1. Curate a concept set of images representing a particular bias (e.g., gender)
(ok to use generative models, web search, or anything else you have.)
2. Penalize the model’s sensitivity to this concept set during training with a new
loss term
Benefits: No need for explicit human annotations, concept sets reusable across models,
concept sets can be application-specific – anything that can be represented by images.
Yiwei Yang
47. 66
Interpretability Beyond Feature Attribution:
Quantitative Testing with Concept Activation Vectors
(TCAV)
Kim et al. ICML 2018 https://arxiv.org/abs/1711.11279
48. 67
Concept: Stripes
Zebra images
A model split at layer l into an encoder and a decoder
Concept Activation Vector Sensitivity of prediction of
zebra to the concept of
stripes at layer l
(Sensitivity of class k at layer l to a concept set C )
Given an image, if we perturb its representation at layer l in the direction of the
concept set “stripes,” how much would the model’s prediction of zebra change?
Kim et al. ICML 2018 https://arxiv.org/abs/1711.11279
49. Bill Howe, UW 68
For each x
in batch,
For each
class logit,
Gradient of the decoder Vector pointing towards “hairstyle”
Loss term to penalize inappropriate explanations
e.g., “hairstyle should not be used to determine profession”
Centroid of images
representing hairstyle
Centroid of
random images
Yiwei Yang
50. 69
Test: Spurious correlation by construction
Mismatched foreground and background
Matched foreground and background
Training set encourages model to attend to the background, not the bird
51. Bill Howe, UW 70
Dataset planted with a spurious correlation:
Mismatched foreground and background
Matched foreground and background
Red = high attention
Blue = low attention
Model ignores the bird, and misclassifies it
52. 71
Yiwei Yang
A simple experiment
Adapted from Colored MNIST dataset, Arjovsky et al, Invariant Risk Minimization https://arxiv.org/abs/1907.02893
Number analogous to profession
Color analogous to race, gender, or any
other bias we want to unlearn
No labels available
Curate concept sets representing…
redness blueness greenness
53. 72
Original model is totally fooled Loss term significantly improves
Yiwei Yang
0
1
2
1 2
0
Prediction
Ground truth
1 2
0
0
1
2
Ground truth
Prediction
Goal: Predict the correct digit based on
shape, but ignore the bias of color
Initial results: We can mitigate bias by
providing lots of images representing colors
and penalize representations that use them
Concept set: Google image search for “red”, “green”, “blue”
Open question: How much does the quality of the concept set matter?
(e.g., “Women” vs. “Woman Doctor”)
Editor's Notes
High-expertise means balancing multiple competing quality metrics – LLMs are making this difficult.
Improving Data-Driven Global Weather Prediction Using Deep Convolutional Neural Networks on a Cubed Sphere, Jonathan A. Weyn, Dale R. Durran, Rich Caruana, Journal of Advances in Modeling Earth Systems, volume 12, issue 9, August 2020 https://doi.org/10.1029/2020MS002109
Sub-Seasonal Forecasting With a Large Ensemble of Deep-Learning Weather Prediction Models, Jonathan A. Weyn,Dale R. Durran,Rich Caruana,Nathaniel Cresswell-Clay, Journal of Advances in Modeling Earth Systems, volume 13, issue 7, July 2021 https://doi.org/10.1029/2021MS002502
Why
eviction lawsuit court records
Download
OCR
Assess accuracy 75%
ML:
OCR to get tokens, train a model to recognize addresses
Predict race from location and surname
Predict sex from firstname and cross-reference with social security database (SSA) and census microdata (IPUMS)