Curation as Programming: AI,
Data Management, and Mediated
Knowledge Interaction
Bill Howe, Ph.D.
Associate Professor, Information School
Co-Director, Responsible AI Systems & Experiences
Adjunct Associate Professor, Allen School of Computer Science & Engineering
Adjunct Associate Professor, Electrical Engineering
University of Washington
1
A view of LLMs from 6 months ago…
• Bigger is different
• “Emergent” capabilities
• A handful of “Foundation Models” can do
any task; no need for specialized
models
• Re-training is too expensive anyway
• No way to compete with
OpenAI/Microsoft, and to a lesser extent
Google Bill Howe, UW 2
3
https://rollcall.com/2023/05/23/lawmakers-suggest-agency-to-
supervise-artificial-intelligence/
May 23 2023
https://www.nytimes.com/2023/05/16/technology/opena
i-altman-artificial-intelligence-regulation.html
Why are OpenAI and others
lobbying for regulation in the US?
4
Open Source models gaining on massive private models
Some enablers:
• Chincilla: 70B params over 4X data outperforms 280B params [1]
• LoRA: freeze pre-trained model, add trainable rank decomposition matrices to
each layer, greatly reducing training params for specialization [2]
• 4-bit quantization (replace every param with 4 bits) appears competitive [3]
• Training on small, specialized datasets offers better results (e.g., [4])
[1] Compute-optimal large language models, https://arxiv.org/abs/2203.15556
[2] LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685
[3] Quantization reduce accuracy in favor of memory size and inference latency https://arxiv.org/abs/2212.09720
[4] Koala: ~1M dialogue examples competitive with massive scrapes https://bair.berkeley.edu/blog/2023/04/03/koala/
March 3 March 13 March 19
More info: https://www.semianalysis.com/p/google-we-have-no-moat-and-neither
In the last few months…
Vicuna: 13B params,
trained on 8 x A100
GPUs in one day
Bard: 137B
params, thousands
of petaflop-days
Bill Howe, UW 5
To create a regulatory barrier and
maintain competitive advantage
Prediction:
Open source models will remain competitive and available.
The curation and management of specialized, minimal, modular
datasets for training and evaluation will drive AI progress.
This is an opportunity for the data management community.
So now: Why are OpenAI and others
lobbying for regulation in the US?
data AI
simulations
internet
databases
generative models
people
data AI
simulations
internet
databases
generative models
people
Ex: most LLMs
massive, noisy data -> huge, general, yet unpredictable models
GPT 3.5: 175B parameters (800GB)
Bard: 137B parameters
GPT-4: ???B billion parameters
data AI
rlhf
scrapes
Curation-on-read: scrape an uncurated convenience
sample of the internet, then implement guardrails
with reinforcement learning from human feedback,
and/or by sanitizing inputs and outputs
Bill Howe, UW 9
No more free data
Inputs Outputs
an armchair in the
shape of an
avocado…
DALL-E, OpenAI
A teapot in the
shape of a rubik’s
cube…
A shrimp with
sunglasses riding a
unicycle…
https://openai.com/blog/dall-e/
Bill Howe, UW 11
an armchair in the
shape of an
avocado…
A [X] year old girl… ? ? ?
Robert Wolfe Prof
Aylin Caliskan
FAccT 23
Yiwei Yang
Training data polluted by
objectified images of women
simulations
internet
databases
generative models
people
data AI
Curation-on-write: Carefully control the training and
evaluation data, and produce correct, verifiable results
curated, trusted data -> specialized, trusted models
Ex: AlphaFold: Protein folding
Ex: DWLP: Weather prediction
https://www.nature.com/articles/d41586-022-00997-5
https://www.nature.com/articles/nrm3461
AlphaFold
covered 60% of
the structure as
of October 2021,
up from 30%
from prior
models
nuclear pore complex:
Largest molecular
machine in human cells
Before 2021: 100,000 structures determined experimentally
and computationally over the last 50 years
Since 2021 with AlphaFold: 992,316 structures and counting;
expecting 130M within a year or two
15
50+ years
100s of millions USD
130k structures
https://www.nature.com/articles/s41586-021-03819-2/figures/1
AlphaFold uses 50 yrs of PDB
16
Cubed sphere grid
U-Net CNN (fairly standard)
Dale Durran, UW
Jonathon Weyn, UW Rich Caruana,
Microsoft
Deep Learning for Weather Prediction
J Adv Model Earth Syst, Volume: 13, Issue: 7, First published: 25 June 2021, DOI: (10.1029/2021MS002502)
17
RMSE at 500 mbar isopotential vs. forecast day
Full-physics model (lower res) (worse)
Deep Learning DWLP (U-Net, CNNs)
Full-physics model (comparable res) (better)
Full-physics model (very high res) (best)
Dale Durran, UW
Jonathon Weyn, UW Rich Caruana,
Microsoft
Deep Learning for Weather Prediction
3 min for 1-mo. ensemble (+ 2-3 days to train)
vs.
16 days for 1-mo. ensemble!
Deep Learning DWLP (high res, ensemble)
Full-physics model (very high res, ensemble)
(better, but not by much!)
JAMES 2020
JAMES 2021
simulations
internet
databases
generative models
people
data AI
My group’s interests:
AI for curation + curation for AI
AI for curation / curation for AI
• UrbanSynth: Curating City Data
– Spatiotemporal imputation
– Spatiotemporal disaggregation
– Spatiotemporal integration
– Information Extraction from Court Records
• Curating Viz Interaction Data
• Mitigating Bias w/ Curated Concept Sets
• SynRD: Evaluating Synthetic Data
• Learning from Curated Ontologies
19
SIGMOD 21, AAAI 20
ICML ws 23
VLDB 24
WWW22, NeurIPS 22
ArXiv 22
VLDB 24 subm.
Cities as complex systems
Cities as assemblies of
independent subsystems
Traffic speed prediction
(Liao et al., 2018)
Demand forecasting
(Uber, 2018)
Crowd flows
(Zhang et al., 2018)
Traffic accident prediction
(Yuan et al., 2018)
European Physical Journal 22
Curation is the bottleneck…
• Sometimes we have expert-curated supervision
– 50+ years of expert experimental curation via PDB enables AlphaFold
– 70+ years of physics-based fluid dynamics models enables DL for
weather
• But city data is incomplete, aggregated, disintegrated
• AI to curate city data
Bill Howe, UW 21
AI-enabled curation of city data
Reconstructed
Missing
Aggregated High-res
Heterogeneous Learned weights
3D
2D
1D
Bin Han
An Yan
SIGMOD 21
https://arxiv.org/abs/2301.04233
VLDB 24 (prep)
J Transp Geography
19
AAAI 20
Wipe out
manhattan
Model
recovers
https://arxiv.org/abs/2301.04233
Bin Han
Key ideas:
1) Borrow ideas from image
in-painting
2) Bias the masking to follow
population distribution
3) Use both space and time
4) Data repair by masking
anomalies and
reconstructing
Reconstructing missing data in space and time
Urban image inpainting https://arxiv.org/abs/2301.04233
Handle sparsity: Biased masking
Bin Han
Bill Howe, UW 28
Parade route anomaly –
no traffic, but unrealistic
12 PM during parade
4 PM after parade
Erase the anomaly and
synthesize new data
Low error in the
prediction
compared to a
non-parade day
https://arxiv.org/abs/2301.04233
Bin Han
Bill Howe, UW 30
New York divided into regions at different scales
Given aggregate data at this level
Learn to disaggregate data at these levels
(supervised by individual data or other variables)
Bin Han
VLDB 24 (prep)
31
Idea: Spatially coherent architecture:
Align the model with the aggregation levels;
compute loss at each level
Use prediction at block level to
predict at tract level, and so on.
Force spatial coherence; learn all
aggregation levels simultaneously.
Different loss strategies: full
reconstruction, prior layer only,
bottom-up only.
Overall:
* Baseline NN outperforms all
classical methods
* Coherent architectures
outperforms baseline by a
significant margin
Bin Han
VLDB 24 (prep)
Encapsulating urban dynamics:
Learning reusable representations from multi-source data
bikeshare demand
+ ML Model
Prediction
Representation Z
An Yan
SIGMOD 21
Bill Howe, UW 34
An Yan
SIGMOD 21
Incorporating Fairness
An Yan
AAAI 20
Exogenous
features matter
Our “everything
at once” model
approximates
perfect variable
selection
Optimizing
for fairness
preserves
accuracy
Oracle-selected
variables
Single variable
Equitensors
(all variables)
AI VISUALIZATION
Bill Howe, UW 37
Bill Howe, UW 38
VizDeck (SIGMOD 12, iConference 13): Generate lots of
visualizations directly from data properties and design rules, try
rules, try to learn from what users select.
(no real ML, because we didn’t know how to do it + bad vis libraries)
Viz Recommendation Trajectory…
Voyager 1 (Vis 15), 2 (Vis 17): Generate alternate visualizations, in
a principled way, from a user-created seed & report on how user
behavior changes
(no real ML, because I couldn’t convince Ham, Dom, and Jeff to do it :)
DRACO (Vis 18, Best paper): Generate visualizations according to
design rules in answer set programming, learn weights on rules
from data.
(a little ML, but we lacked good datasets for training)
Lilly (NLViz workshop 19): Vision for AI-assisted non-expert
analysis and storytelling.
(We didn’t build it then, now it’s much easier, but we still lack data)
Bill Howe, UW 40
Consistency
warnings
Zening Qu
Multi-view consistency checks
Bill Howe, UW 41
Zening Qu
Multi-view consistency checks
Afford direct navigation
of the revision space.
Hover for preview, click
to set root node.
Curated interaction log
Bill Howe, UW 55
Surj: Ontological Learning for Fast, Accurate, and
Robust Hierarchical Multi-label Classification
Sean Yang
Problem: Classify into an
ontology
Simple idea:
Embed the ontology
Embed the data
Learn the mapping
SoTA results, better
robustness, much faster
GraphLearning@WWW 22
Bill Howe, UW 56
Sean Yang Bernease
Herman
Surj wins on almost all
existing benchmarks, but we
show the existing
benchmarks aren’t
measuring anything useful.
Tool to generate better
benchmarks
NeurIPS 22
impossible
trivial
good
57
src: wikipedia
2nd to Last slide
As models become commoditized, AI reduces to data curation
A hundred flowers will blossom – many specialized models,
trained on many specialized datasets
“Curation” means filtering, cleaning, integrating multi-modal data
from a variety of sources: web, structured DBs, generative
models, simulations, crowdsourcing.
These are core strengths for the DB community: Queries are
curation
Databases are “training set management systems”
58
src: wikipedia
Last slide
Examples of Open Questions:
• Querying for data near the decision boundary of a model
• Querying generative models at scale: Create a diverse set of
images with specific properties
• Dataset minimization: smaller dataset, same model performance
• Data-aware model minification
• Experiment management systems
• Synthetic data management systems
• Bias mitigation in queries: Return query results with specific
statistical properties
59
Sean Yang
Bernease Herman
Bin Han Yiwei Yang
Maxim Grechkin Poshen Lee
Binbing Wen
graduated
Alumni
Year 1-3
Jackson Brown
Annie Yan
Shrainik Jain
Where we are…
• Sometimes we have highly curated data for supervision
– 50+ years of expert experimental curation via PDB enables
AlphaFold
– plus lots of domain expertise in the model itself
• Other times data is incomplete or disintegrated
– So: Use AI to replace missing data, disaggregate, and integrate to
complete the data record
– Also: Synthesize scenarios that don’t appear in the data record
• Improve robustness to unseen / OOD events
• Disasters, anomalies, hypothetical situations
• Other times the bias is more subtle
– Next: Unlearning bias in vision-language models
Bill Howe, UW 60
61
Curation as programming in vision models
Vision models trained on convenience samples learn societal biases
Yiwei Yang
Goal: Penalize this bias during training, without human labeling
62
A man sitting at a desk
with a laptop computer.
A woman sitting in front
of a laptop computer.
A man holding a tennis
racquet on a tennis court.
A man holding a tennis
racquet on a tennis court.
Right answer,
wrong reasons
Wrong answer,
wrong reasons
Proposed solution is
specific to gender
Two new loss terms:
1) reduce confidence
when gender
information is absent
2) increase confidence
when it is present
Women also Snowboard: Overcoming Bias in Captioning Models
Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach, CVPR 2018
Bill Howe, UW 63
nurse doctor doctor
Hypothetical ”good” visual explanations
Yiwei Yang
Bill Howe, UW 64
nurse doctor nurse
Hypothetical bad visual explanations
reflecting societal biases
Yiwei Yang
Bill Howe, UW 65
Key Idea: Penalize the model for using the wrong explanations
How?
1. Curate a concept set of images representing a particular bias (e.g., gender)
(ok to use generative models, web search, or anything else you have.)
2. Penalize the model’s sensitivity to this concept set during training with a new
loss term
Benefits: No need for explicit human annotations, concept sets reusable across models,
concept sets can be application-specific – anything that can be represented by images.
Yiwei Yang
66
Interpretability Beyond Feature Attribution:
Quantitative Testing with Concept Activation Vectors
(TCAV)
Kim et al. ICML 2018 https://arxiv.org/abs/1711.11279
67
Concept: Stripes
Zebra images
A model split at layer l into an encoder and a decoder
Concept Activation Vector Sensitivity of prediction of
zebra to the concept of
stripes at layer l
(Sensitivity of class k at layer l to a concept set C )
Given an image, if we perturb its representation at layer l in the direction of the
concept set “stripes,” how much would the model’s prediction of zebra change?
Kim et al. ICML 2018 https://arxiv.org/abs/1711.11279
Bill Howe, UW 68
For each x
in batch,
For each
class logit,
Gradient of the decoder Vector pointing towards “hairstyle”
Loss term to penalize inappropriate explanations
e.g., “hairstyle should not be used to determine profession”
Centroid of images
representing hairstyle
Centroid of
random images
Yiwei Yang
69
Test: Spurious correlation by construction
Mismatched foreground and background
Matched foreground and background
Training set encourages model to attend to the background, not the bird
Bill Howe, UW 70
Dataset planted with a spurious correlation:
Mismatched foreground and background
Matched foreground and background
Red = high attention
Blue = low attention
Model ignores the bird, and misclassifies it
71
Yiwei Yang
A simple experiment
Adapted from Colored MNIST dataset, Arjovsky et al, Invariant Risk Minimization https://arxiv.org/abs/1907.02893
Number analogous to profession
Color analogous to race, gender, or any
other bias we want to unlearn
No labels available
Curate concept sets representing…
redness blueness greenness
72
Original model is totally fooled Loss term significantly improves
Yiwei Yang
0
1
2
1 2
0
Prediction
Ground truth
1 2
0
0
1
2
Ground truth
Prediction
Goal: Predict the correct digit based on
shape, but ignore the bias of color
Initial results: We can mitigate bias by
providing lots of images representing colors
and penalize representations that use them
Concept set: Google image search for “red”, “green”, “blue”
Open question: How much does the quality of the concept set matter?
(e.g., “Women” vs. “Woman Doctor”)

HILDA 2023 Keynote Bill Howe

  • 1.
    Curation as Programming:AI, Data Management, and Mediated Knowledge Interaction Bill Howe, Ph.D. Associate Professor, Information School Co-Director, Responsible AI Systems & Experiences Adjunct Associate Professor, Allen School of Computer Science & Engineering Adjunct Associate Professor, Electrical Engineering University of Washington 1
  • 2.
    A view ofLLMs from 6 months ago… • Bigger is different • “Emergent” capabilities • A handful of “Foundation Models” can do any task; no need for specialized models • Re-training is too expensive anyway • No way to compete with OpenAI/Microsoft, and to a lesser extent Google Bill Howe, UW 2
  • 3.
  • 4.
    4 Open Source modelsgaining on massive private models Some enablers: • Chincilla: 70B params over 4X data outperforms 280B params [1] • LoRA: freeze pre-trained model, add trainable rank decomposition matrices to each layer, greatly reducing training params for specialization [2] • 4-bit quantization (replace every param with 4 bits) appears competitive [3] • Training on small, specialized datasets offers better results (e.g., [4]) [1] Compute-optimal large language models, https://arxiv.org/abs/2203.15556 [2] LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685 [3] Quantization reduce accuracy in favor of memory size and inference latency https://arxiv.org/abs/2212.09720 [4] Koala: ~1M dialogue examples competitive with massive scrapes https://bair.berkeley.edu/blog/2023/04/03/koala/ March 3 March 13 March 19 More info: https://www.semianalysis.com/p/google-we-have-no-moat-and-neither In the last few months… Vicuna: 13B params, trained on 8 x A100 GPUs in one day Bard: 137B params, thousands of petaflop-days
  • 5.
    Bill Howe, UW5 To create a regulatory barrier and maintain competitive advantage Prediction: Open source models will remain competitive and available. The curation and management of specialized, minimal, modular datasets for training and evaluation will drive AI progress. This is an opportunity for the data management community. So now: Why are OpenAI and others lobbying for regulation in the US?
  • 6.
  • 7.
  • 8.
    simulations internet databases generative models people Ex: mostLLMs massive, noisy data -> huge, general, yet unpredictable models GPT 3.5: 175B parameters (800GB) Bard: 137B parameters GPT-4: ???B billion parameters data AI rlhf scrapes Curation-on-read: scrape an uncurated convenience sample of the internet, then implement guardrails with reinforcement learning from human feedback, and/or by sanitizing inputs and outputs
  • 9.
    Bill Howe, UW9 No more free data
  • 10.
    Inputs Outputs an armchairin the shape of an avocado… DALL-E, OpenAI A teapot in the shape of a rubik’s cube… A shrimp with sunglasses riding a unicycle… https://openai.com/blog/dall-e/
  • 11.
    Bill Howe, UW11 an armchair in the shape of an avocado… A [X] year old girl… ? ? ?
  • 12.
    Robert Wolfe Prof AylinCaliskan FAccT 23 Yiwei Yang Training data polluted by objectified images of women
  • 13.
    simulations internet databases generative models people data AI Curation-on-write:Carefully control the training and evaluation data, and produce correct, verifiable results curated, trusted data -> specialized, trusted models Ex: AlphaFold: Protein folding Ex: DWLP: Weather prediction
  • 14.
    https://www.nature.com/articles/d41586-022-00997-5 https://www.nature.com/articles/nrm3461 AlphaFold covered 60% of thestructure as of October 2021, up from 30% from prior models nuclear pore complex: Largest molecular machine in human cells Before 2021: 100,000 structures determined experimentally and computationally over the last 50 years Since 2021 with AlphaFold: 992,316 structures and counting; expecting 130M within a year or two
  • 15.
    15 50+ years 100s ofmillions USD 130k structures https://www.nature.com/articles/s41586-021-03819-2/figures/1 AlphaFold uses 50 yrs of PDB
  • 16.
    16 Cubed sphere grid U-NetCNN (fairly standard) Dale Durran, UW Jonathon Weyn, UW Rich Caruana, Microsoft Deep Learning for Weather Prediction J Adv Model Earth Syst, Volume: 13, Issue: 7, First published: 25 June 2021, DOI: (10.1029/2021MS002502)
  • 17.
    17 RMSE at 500mbar isopotential vs. forecast day Full-physics model (lower res) (worse) Deep Learning DWLP (U-Net, CNNs) Full-physics model (comparable res) (better) Full-physics model (very high res) (best) Dale Durran, UW Jonathon Weyn, UW Rich Caruana, Microsoft Deep Learning for Weather Prediction 3 min for 1-mo. ensemble (+ 2-3 days to train) vs. 16 days for 1-mo. ensemble! Deep Learning DWLP (high res, ensemble) Full-physics model (very high res, ensemble) (better, but not by much!) JAMES 2020 JAMES 2021
  • 18.
    simulations internet databases generative models people data AI Mygroup’s interests: AI for curation + curation for AI
  • 19.
    AI for curation/ curation for AI • UrbanSynth: Curating City Data – Spatiotemporal imputation – Spatiotemporal disaggregation – Spatiotemporal integration – Information Extraction from Court Records • Curating Viz Interaction Data • Mitigating Bias w/ Curated Concept Sets • SynRD: Evaluating Synthetic Data • Learning from Curated Ontologies 19 SIGMOD 21, AAAI 20 ICML ws 23 VLDB 24 WWW22, NeurIPS 22 ArXiv 22 VLDB 24 subm.
  • 20.
    Cities as complexsystems Cities as assemblies of independent subsystems Traffic speed prediction (Liao et al., 2018) Demand forecasting (Uber, 2018) Crowd flows (Zhang et al., 2018) Traffic accident prediction (Yuan et al., 2018) European Physical Journal 22
  • 21.
    Curation is thebottleneck… • Sometimes we have expert-curated supervision – 50+ years of expert experimental curation via PDB enables AlphaFold – 70+ years of physics-based fluid dynamics models enables DL for weather • But city data is incomplete, aggregated, disintegrated • AI to curate city data Bill Howe, UW 21
  • 22.
    AI-enabled curation ofcity data Reconstructed Missing Aggregated High-res Heterogeneous Learned weights 3D 2D 1D Bin Han An Yan SIGMOD 21 https://arxiv.org/abs/2301.04233 VLDB 24 (prep) J Transp Geography 19 AAAI 20
  • 23.
    Wipe out manhattan Model recovers https://arxiv.org/abs/2301.04233 Bin Han Keyideas: 1) Borrow ideas from image in-painting 2) Bias the masking to follow population distribution 3) Use both space and time 4) Data repair by masking anomalies and reconstructing Reconstructing missing data in space and time
  • 24.
    Urban image inpaintinghttps://arxiv.org/abs/2301.04233 Handle sparsity: Biased masking Bin Han
  • 25.
    Bill Howe, UW28 Parade route anomaly – no traffic, but unrealistic 12 PM during parade 4 PM after parade Erase the anomaly and synthesize new data Low error in the prediction compared to a non-parade day https://arxiv.org/abs/2301.04233 Bin Han
  • 26.
    Bill Howe, UW30 New York divided into regions at different scales Given aggregate data at this level Learn to disaggregate data at these levels (supervised by individual data or other variables) Bin Han VLDB 24 (prep)
  • 27.
    31 Idea: Spatially coherentarchitecture: Align the model with the aggregation levels; compute loss at each level Use prediction at block level to predict at tract level, and so on. Force spatial coherence; learn all aggregation levels simultaneously. Different loss strategies: full reconstruction, prior layer only, bottom-up only. Overall: * Baseline NN outperforms all classical methods * Coherent architectures outperforms baseline by a significant margin Bin Han VLDB 24 (prep)
  • 28.
    Encapsulating urban dynamics: Learningreusable representations from multi-source data bikeshare demand + ML Model Prediction Representation Z An Yan SIGMOD 21
  • 29.
    Bill Howe, UW34 An Yan SIGMOD 21
  • 30.
  • 31.
    Exogenous features matter Our “everything atonce” model approximates perfect variable selection Optimizing for fairness preserves accuracy Oracle-selected variables Single variable Equitensors (all variables)
  • 32.
  • 33.
    Bill Howe, UW38 VizDeck (SIGMOD 12, iConference 13): Generate lots of visualizations directly from data properties and design rules, try rules, try to learn from what users select. (no real ML, because we didn’t know how to do it + bad vis libraries) Viz Recommendation Trajectory… Voyager 1 (Vis 15), 2 (Vis 17): Generate alternate visualizations, in a principled way, from a user-created seed & report on how user behavior changes (no real ML, because I couldn’t convince Ham, Dom, and Jeff to do it :) DRACO (Vis 18, Best paper): Generate visualizations according to design rules in answer set programming, learn weights on rules from data. (a little ML, but we lacked good datasets for training) Lilly (NLViz workshop 19): Vision for AI-assisted non-expert analysis and storytelling. (We didn’t build it then, now it’s much easier, but we still lack data)
  • 34.
    Bill Howe, UW40 Consistency warnings Zening Qu Multi-view consistency checks
  • 35.
    Bill Howe, UW41 Zening Qu Multi-view consistency checks Afford direct navigation of the revision space. Hover for preview, click to set root node. Curated interaction log
  • 36.
    Bill Howe, UW55 Surj: Ontological Learning for Fast, Accurate, and Robust Hierarchical Multi-label Classification Sean Yang Problem: Classify into an ontology Simple idea: Embed the ontology Embed the data Learn the mapping SoTA results, better robustness, much faster GraphLearning@WWW 22
  • 37.
    Bill Howe, UW56 Sean Yang Bernease Herman Surj wins on almost all existing benchmarks, but we show the existing benchmarks aren’t measuring anything useful. Tool to generate better benchmarks NeurIPS 22 impossible trivial good
  • 38.
    57 src: wikipedia 2nd toLast slide As models become commoditized, AI reduces to data curation A hundred flowers will blossom – many specialized models, trained on many specialized datasets “Curation” means filtering, cleaning, integrating multi-modal data from a variety of sources: web, structured DBs, generative models, simulations, crowdsourcing. These are core strengths for the DB community: Queries are curation Databases are “training set management systems”
  • 39.
    58 src: wikipedia Last slide Examplesof Open Questions: • Querying for data near the decision boundary of a model • Querying generative models at scale: Create a diverse set of images with specific properties • Dataset minimization: smaller dataset, same model performance • Data-aware model minification • Experiment management systems • Synthetic data management systems • Bias mitigation in queries: Return query results with specific statistical properties
  • 40.
    59 Sean Yang Bernease Herman BinHan Yiwei Yang Maxim Grechkin Poshen Lee Binbing Wen graduated Alumni Year 1-3 Jackson Brown Annie Yan Shrainik Jain
  • 41.
    Where we are… •Sometimes we have highly curated data for supervision – 50+ years of expert experimental curation via PDB enables AlphaFold – plus lots of domain expertise in the model itself • Other times data is incomplete or disintegrated – So: Use AI to replace missing data, disaggregate, and integrate to complete the data record – Also: Synthesize scenarios that don’t appear in the data record • Improve robustness to unseen / OOD events • Disasters, anomalies, hypothetical situations • Other times the bias is more subtle – Next: Unlearning bias in vision-language models Bill Howe, UW 60
  • 42.
    61 Curation as programmingin vision models Vision models trained on convenience samples learn societal biases Yiwei Yang Goal: Penalize this bias during training, without human labeling
  • 43.
    62 A man sittingat a desk with a laptop computer. A woman sitting in front of a laptop computer. A man holding a tennis racquet on a tennis court. A man holding a tennis racquet on a tennis court. Right answer, wrong reasons Wrong answer, wrong reasons Proposed solution is specific to gender Two new loss terms: 1) reduce confidence when gender information is absent 2) increase confidence when it is present Women also Snowboard: Overcoming Bias in Captioning Models Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach, CVPR 2018
  • 44.
    Bill Howe, UW63 nurse doctor doctor Hypothetical ”good” visual explanations Yiwei Yang
  • 45.
    Bill Howe, UW64 nurse doctor nurse Hypothetical bad visual explanations reflecting societal biases Yiwei Yang
  • 46.
    Bill Howe, UW65 Key Idea: Penalize the model for using the wrong explanations How? 1. Curate a concept set of images representing a particular bias (e.g., gender) (ok to use generative models, web search, or anything else you have.) 2. Penalize the model’s sensitivity to this concept set during training with a new loss term Benefits: No need for explicit human annotations, concept sets reusable across models, concept sets can be application-specific – anything that can be represented by images. Yiwei Yang
  • 47.
    66 Interpretability Beyond FeatureAttribution: Quantitative Testing with Concept Activation Vectors (TCAV) Kim et al. ICML 2018 https://arxiv.org/abs/1711.11279
  • 48.
    67 Concept: Stripes Zebra images Amodel split at layer l into an encoder and a decoder Concept Activation Vector Sensitivity of prediction of zebra to the concept of stripes at layer l (Sensitivity of class k at layer l to a concept set C ) Given an image, if we perturb its representation at layer l in the direction of the concept set “stripes,” how much would the model’s prediction of zebra change? Kim et al. ICML 2018 https://arxiv.org/abs/1711.11279
  • 49.
    Bill Howe, UW68 For each x in batch, For each class logit, Gradient of the decoder Vector pointing towards “hairstyle” Loss term to penalize inappropriate explanations e.g., “hairstyle should not be used to determine profession” Centroid of images representing hairstyle Centroid of random images Yiwei Yang
  • 50.
    69 Test: Spurious correlationby construction Mismatched foreground and background Matched foreground and background Training set encourages model to attend to the background, not the bird
  • 51.
    Bill Howe, UW70 Dataset planted with a spurious correlation: Mismatched foreground and background Matched foreground and background Red = high attention Blue = low attention Model ignores the bird, and misclassifies it
  • 52.
    71 Yiwei Yang A simpleexperiment Adapted from Colored MNIST dataset, Arjovsky et al, Invariant Risk Minimization https://arxiv.org/abs/1907.02893 Number analogous to profession Color analogous to race, gender, or any other bias we want to unlearn No labels available Curate concept sets representing… redness blueness greenness
  • 53.
    72 Original model istotally fooled Loss term significantly improves Yiwei Yang 0 1 2 1 2 0 Prediction Ground truth 1 2 0 0 1 2 Ground truth Prediction Goal: Predict the correct digit based on shape, but ignore the bias of color Initial results: We can mitigate bias by providing lots of images representing colors and penalize representations that use them Concept set: Google image search for “red”, “green”, “blue” Open question: How much does the quality of the concept set matter? (e.g., “Women” vs. “Woman Doctor”)

Editor's Notes

  • #13 High-expertise means balancing multiple competing quality metrics – LLMs are making this difficult.
  • #18 Improving Data-Driven Global Weather Prediction Using Deep Convolutional Neural Networks on a Cubed Sphere, Jonathan A. Weyn, Dale R. Durran, Rich Caruana, Journal of Advances in Modeling Earth Systems, volume 12, issue 9, August 2020 https://doi.org/10.1029/2020MS002109 Sub-Seasonal Forecasting With a Large Ensemble of Deep-Learning Weather Prediction Models, Jonathan A. Weyn,Dale R. Durran,Rich Caruana,Nathaniel Cresswell-Clay, Journal of Advances in Modeling Earth Systems, volume 13, issue 7, July 2021 https://doi.org/10.1029/2021MS002502
  • #34 Why
  • #43  eviction lawsuit court records Download OCR Assess accuracy 75% ML: OCR to get tokens, train a model to recognize addresses Predict race from location and surname Predict sex from firstname and cross-reference with social security database (SSA) and census microdata (IPUMS)