ETC & Authors in the Driver's Seat

ETC & Authors in the Driver’s Seat
vs
YesWorkflow: Revealing data-/workflow from scripts
Kurator: Automating data curation workflows
EulerX: Agreeing to disagree about taxonomies
Whole-Tale: Reproducible, computational narratives
Bertram Ludäscher
ludaesch@illinois.edu
ETC+Authors @ Biosphere 2
2018-01-10..12
Director, Center for Informatics Research in Science & Scholarship (CIRSS)
School of Information Sciences (iSchool@Illinois)
& National Center for Supercomputing Applications (NCSA)
& Department of Computer Science (CS@Illinois)
1

Author’s Driving ..
• Curators: dealing with problems of data quality,
reuse, interoperability, etc. as soon as they can
– but often: “down the road”…
• Authors: address (meta-)data quality upstream
– .. at the source, when data is created
=> Resonates with “empowering scientists” theme
we’re pursuing in other projects (e.g. WT, YW ..)
Ludäscher: Workflows & Provenance => Understanding 2

Provenance (Lineage) matters …
• One of these sold for $180M, the other one for
$22K (but could be worth more ... definitely maybe ...)
• Which one would you like to own?

Provenance (Lineage) matters …
• One of these sold for $180M, the other one for …
• … $450M !!!

Provenance is: keeping records …
• Grand Canyon’s rock layers are a record of the early geologic history of North America.
The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more
recent human history. (By Drenaline, licensed under CC BY-SA 3.0)
• Not shown: computational archaeologists reconstructing past climate from multiple tree-
ring databases è computational provenance is key for transparency & reproducibility

... and provenance is:
Understanding what happened!
Zrzavý, Jan, David Storch, and Stanislav
Mihulka. Evolution: Ein Lese-Lehrbuch.
Springer-Verlag, 2009.
Author: Jkwchui (Based on
drawing by Truth-seeker2004)
Ludäscher: Workflows & Provenance => Understanding
6

Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program

Rewind: Data Curation Workflows
(Filtered-Push … Kepler … Kurator projects)
8

Data Curation Workflows & Provenance
• Data curation and data cleaning workflows
– … can be defined using a workflow system
• workflow = “prospective” provenance (= general recipe)
– ... or using good-old scripts (bash, Python, R, ...)
• … which is what many “mere mortals” use!
• Script-based workflows
– … benefit from having the workflow exposed and
dataflow dependencies revealed

Runtime Provenance
(a.k.a. traces, logs,
retrospective
provenance,
“Trace-land”)
Workflow Modeling & Design
(a.k.a. prospective provenance
“Workflow-land”)
10
Workflows ó Provenance an important link!

= W3C PROV + DataONE extensions
11
Trace
Workflow
Data (extensible)
See purl.dataone.org/provone-v1-dev

• … NSF SKOPE: system and tools to discover,
access, analyze, visualize paleoenvironmental
data
– unprecedented ability to explore provenance
(detailed, comprehensible record of computational
derivation of results)
– for researchers, tinkerers, and modelers
• … NSF Whole Tale:
– leverage & contribute to existing CI to support the
whole tale (“living paper”), from workflow run to
scholarly publication
– integrate tools & CI (DataONE, Globus, iRODS,
NDS, ...) to simplify use and promote best
practices.
– driven by science WGs (Archaeology/SKOPE,
materials science, astro, bio ..)
Related Projects: NSF DataONE (ProvONE ..) + …

Provenance Support for Reproducible Science
Example: Paleoclimate Reconstruction
Science paper (OA) uses:
• open source code:
– R, PaleoCAR, …
• Is that all we need?
• What was the
“workflow”?
• Is there prospective
and/or retrospective
provenance?

SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …

YesWorkflow: Prospective & Retrospective
Provenance … (almost) for free!
• YW annotations in
a (Python, R, …)
script recreate a
workflow view
from the script …
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!
15
@BEGIN .. @END ..
@IN .. @OUT ..
@URI .. @LOG ..

GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate Reconstruction (openSKOPE.org)
• … explained using YesWorkflow!
Kyle B., (computational) archaeologist:
"It took me about 20 minutes to comment. Less
than an hour to learn and YW-annotate, all-told."

YW Demo Use Cases (IDCC’17)
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager

run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt

YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
cassette_id
sample_score_cutoff
sample_spreadsheet
calibration_image
initialize_run
run_log
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
collect_data_set
sample_id energy frame_number
raw_image
transform_images
corrected_image
collection_log
• URI-templates link conceptual entities to
runtime provenance “left behind” by the
script author …
• … facilitating provenance reconstructionLudäscher: Workflows & Provenance => Understanding
18

initialize_run
run_log
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_idenergyframe_number
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q1: What samples did the script run collect images
from?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

19

initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q2: What energies were used for image collection from
sample DRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

20

initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q3: Where is the raw image of the corrected image
DRT322_11000ev_030.img? run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

21

initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

Q5: What cassette-id had the sample leading to
DRT240_10000ev_001.img?
22

Hybrid Provenance:
YW Model + Runtime
Observables (file level)
23
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
• The YW model can be connected
with runtime observables
• è YW recon (prov reconstruction)
• Here:
• What specific files were read,
written and where do they occur
in the workflow?

C3-C4 Prospective Provenance
C3_C4_map_present_NA
fetch_SYNMAP_land_cover_map_variable
lon_variable lat_variable lon_bnds_variable lat_bnds_variable
fetch_monthly_mean_air_temperature_data
Tair_Matrix
fetch_monthly_mean_precipitation_data
Rain_Matrix
initialize_Grass_Matrix
Grass_variable
examine_pixels_for_grass
C3_Data C4_Data
generate_netcdf_file_for_C3_fraction
C3_fraction_data
file:outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc
C4_fraction_data
file:outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc
generate_netcdf_file_for_Grass_fraction
Grass_fraction_data
file:outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
SYNMAP_land_cover_map_data
inputs/land_cover/SYNMAP_NA_QD.nc
mean_airtemp
file:inputs/narr_air.2m_monthly/air.2m_monthly_{start_year}_{end_year}_mean.{month}.nc
mean_precip
file:inputs/narr_apcp_rescaled_monthly/apcp_monthly_{start_year}_{end_year}_mean.{month}.nc
24

What does C4_fraction_data depend on ?
C4_Data
Rain_Matrix
Tair_Matrix
C4_fraction_data
mean_airtempmean_precip
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
C4_fraction_data
Grass_fraction_data
C4_fraction_data
lineage very similar to
overall workflow graph!
25

What does Grass_fraction_data depend on?
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
C4_fraction_data
Grass_fraction_data
C4_fraction_data lineage different from overall workflow graph!
- Smaller subgraph
- Depends on only 1 of 3 inputs!
Grass_variable
Grass_fraction_data
26

What happens after running the script?
Hybrid provenance graph!
• 3 inputs spread across
25 (=2x24 + 1) files
• Do all 3 output files
depend on all 25
inputs?
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc
C4_fraction_data
Grass_fraction_data
outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
mean_airtemp
inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.nc
mean_precip
inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.nc

What C4_fraction_data depends on (hybrid) …
C4_Data
Rain_Matrix
Tair_Matrix
C4_fraction_data
Earlier prospective
query result
C4_Data
Rain_Matrix
Tair_Matrix
C4_fraction_data
mean_airtemp
mean_precip
28

What Grass_fraction_data depends on (hybrid)…
Grass_variable
Grass_fraction_data
Grass_variable
Grass_fraction_data
outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
C4_fraction_data
Grass_fraction_data
Overall workflow
Upstream of
Grass_fraction_data
(prospective)
Upstream of Grass_fraction_data
(hybrid)
# @BEGIN
Gravitational_Wave_Detection
# @IN fn_d @as FN_Detector
# @IN fn_sr @as FN_Sampling_Rate
# @OUT shifted.wav @as
shifted_wave
# @OUT whitenbp.wav @as
whitened_bandpass
import numpy as np
from scipy import signal
…
# @BEGIN
Amplitude_Spectral_Density
# @IN strain_H1
# @IN strain_L1
# @PARAM fs
# @OUT psd_H1
# @OUT psd_L1
# @OUT GW150914_ASDs.png @URI …
…
NFFT = 1*fs
fmin, fmax = 10, 2000
…
YesWorkflow-annotated
scripts
Logic rules for r
querying, and
prospective and
provenance
upstream(strain_LI_whitenbp) [prospective]
WHITENING
strain_H1_whiten strain_L1_whiten
AMPLITUDE_SPECTRAL_DENSITY
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
ﬁle:{Detector}_LOSC_4_V1-...
FN_Sampling_rate
ﬁle:H-H1_LOSC_{Rate}_V1-...
fs
upstream(strain_L1_whitenbp) [URI-recon]
WHITENING
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
L-L1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
H-H1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_16_V1-1126259446-32.hdf5
fs
YesWorkflow toolkit
Extract annotations and
model script as a workflow
YesWorkflow
Reconstruct scrip
retrospective pro
YesWorkflow toolkit
Render workflow
model graphically
Prospective Provenance
user-defined
workflow models
Hybrid Pro
General purpose prov
Provenance q
Query proven
(esp. graphs
visualize re
Workflow model (graph)
Facts (Prolog)
Reconstructed pr
Facts (Prol
prospective + file-
level runtime
observables
29

LIGO example: What strain_L1_whitenbp depends on …
Overall workflow
Upstream of
strain_L1_whitenbp
(prospective)
GRAVITATIONAL_WAVE_DETECTION
LOAD_DATA
Load hdf5 data.
strain_H1strain_L1 strain_16 strain_4
Amplitude spectral density.
ASDs
file:GW150914_ASDs.png
PSD_H1PSD_L1
WHITENING
suppress low frequencies noise.
BANDPASSING
remove high frequency noise.
strain_H1_whitenbp strain_L1_whitenbp
STRAIN_WAVEFORM_FOR_WHITENED_DATA
plot whitened data.
WHITENED_strain_data
file:GW150914_strain_whitened.png
SPECTROGRAMS_FOR_STRAIN_DATA
plot spectrogram for strain data.
spectrogram
file:GW150914_{detector}_spectrogram.png
SPECTROGRAMS_FOR_WHITEND_DATA
plot spectrogram for whitened data.
spectrogram_whitened
file:GW150914_{detector}_spectrogram_whitened.png
FILTER_COEFS
Filter signal in time domain (bandpassing).
COEFFICIENTS
FILTER_DATA
filter data.
filtered_white_noise_data
file:GW150914_filter.png
strain_H1_filtstrain_L1_filt
STRAIN_WAVEFORM_FOR_FILTERED_DATA
plot the filtered data.
H1_strain_filtered
file:GW150914_H1_strain_filtered.png
H1_strain_unfiltered
file:GW150914_H1_strain_unfiltered.png
WAVE_FILE_GENERATOR_FOR_WHITENED_DATA
Make sound files for whitened data.
whitened_bandpass_wavefile
file:GW150914_{detector}_whitenbp.wav
SHIFT_FREQUENCY_BANDPASSED
shift frequency of bandpassed signal.
strain_H1_shifted strain_L1_shifted
WAVE_FILE_GENERATOR_FOR_SHIFTED_DATA
Make sound files for shifted data.
shifted_wavefile
file:GW150914_{detector}_shifted.wav
DOWNSAMPLING
Downsampling from 16384 Hz to 4096 Hz.
H1_ASD_SamplingRate
file:GW150914_H1_ASD_{SamplingRate}.png
FN_Detector
file:{Detector}_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
file:H-H1_LOSC_{DownSampling}_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [prospective]
WHITENING
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
file:{Detector}_LOSC_4_V1-...
FN_Sampling_rate
file:H-H1_LOSC_{Rate}_V1-...
fs
upstream(strain_L1_whitenbp) [URI-recon]
WHITENING
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
L-L1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
H-H1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_16_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [NW-recon]
WHITENING
strain_L1_whiten
strain_L1_whiten = array([8.494, -1.672, ..., 72.156])
PSD_L1
psd_L1 = scipy.interpolate.interpolate.interp1d
object at 0x113969418
LOAD_DATA
strain_L1
strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])
BANDPASSING
strain_L1_whitenbp
strain_L1_whitenbp = array([8.184, 19.935,..., -0.684])
FN_Detector
fn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5
fs
fs = 4096
Upstream of strain_L1_whitenbp
(hybrid YW-NW at the code-
level)
Upstream of strain_L1_whitenbp
(hybrid YW-NW at the file-level)
3 inputs spread across
5 (=2x2 + 1) files
Does intermediate data
strain_L1_whitenbp
depend on all 5 inputs?
• Intermediate data
strain_L1_whiten
bp depend only
on 2 out of 5
inputs!
30

DwCA Taxon Lookup
Workflow
• Declare inputs, outputs, and
steps of a script (or wf) with
YW annotations to ...
– communicate provenance
graphically (via graphviz)
– combine different forms of
provenance
– query provenance
• Simple YW annotations in
comments:
– @BEGIN Step, @END Step
– @IN Data, @OUT Data
– @URI Template, @LOG Pattern
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

Taxon Lookup Workflow:
Data View and Process View
32

The story of
two individual
records
33
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
• One took the GBIF
route, while …
• … the other went
all WORMS!

The aggregate story ..
34
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
• How many records were
observed as inputs or outputs
of workflow steps?
• Were there any NULL values?
How many?

Summary I
• YW annotations can be added
easily to your scripts to reap
workflow benefits
– Documentation of what’s
important
– Visualization of dependencies
– Querying provenance
(prospective, retrospective,
and hybrid)
è make provenance actionable
è provenance for self!
=> github.com/yesworkflow-org/yw
=> try.yesworkflow.org
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

João F. Pimentel, Saumen Dey, Timothy McPhillips,
Khalid Belhajjame, David Koop, Leonardo Murta,
Vanessa Braganholo, Bertram Ludäscher
Yin & Yang: Demonstrating complementary
provenance from noWorkflow &
YesWorkflow
36

module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 collection_log
121 writer.writerow
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
noWorkflow:
not only
Workflow!
• Scripts have provenance, too!
• Transparently capture some/all
provenance from Python script
runs.
• Use filter queries to “zoom” into
relevant parts ..
37

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
$ now dataflow -f "run/data/DRT240/DRT240_11000eV_002.img"
$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)
now helper df_style.py
now dataflow -v 55 -f
$(RETROSPECTIVE_LINEAGE_VALUE) -m simulation
| python df_style.py -d BT -e >
$(NW_FILTERED_LINEAGE_GRAPH).gv
.. auto-“make” this!
noWorkflow lineage
of an image file
Provenance information
about Python function calls,
variable assignments, etc.
38

initialize_run
run_log load_screening_results
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 write
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 collection_log
121 writer.writerow
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/collected_images.csv
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
lineage query
lineage query
YesWorkflow:
Conceptual workflow model
noWorkflow:
Python trace model
But how do we
bridge this gap???
Would like to use YW
model to query NW
data!
39

Habemus Pons!
We’ve got the Bridge!
The bridge is the journey..
(The journey is the destination)
Lineage of image file
in terms of YW
model, with details
from NW provenance
40

DataONE: Search and Provenance Display
41

DataONE: Search and Provenance Display
42

Adding YesWorkflow to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
43

Demo Time
44
(Disclaimer) https://github.com/idaks/dataone-ahm-2016-poster
https://github.com/idaks/wt-prov-summer-2017
https://github.com/yesworkflow-org/yw-idcc-17

Whole Tale: The next step in the evolution of
the scholarly article: The “Living” Paper
• 1st Generation:
– narrative (prose)
• 2nd Generation: plus …
– name .. identify .. include (access to) data
• 3rd Generation: plus …
– name .. reference .. include code (software) ..
– and provenance … and exec environment (containers)
Whole Tale
Whole Tale Dashboard

Whole Tale: What’s in a name?
(1) Whole Tale ⇔ Whole Story:
◦ Support (computational / data) scientists
◦ … along the complete research lifecycle
◦ ... from experiment to (new kind of) publication
◦ ... and back!
(2) Whole Tale ⇔ for the Long Tail of Science
–Easy sharing of your computational narratives, data, and
exec-env since 2017!
–Power applications for everyone!
46Ludäscher: Workflows & Provenance => Understanding

Whole Tale Vision
• Can't reproduce result because:
• Don't know how to run analysis
• Can't get the software running
• Can't pay for the computer or compute
power the result was computed on
Source: Bryce Mecum, NCEAS (WT team)
47

Whole Tale Vision
Addressing reproducibility
4
8
Data Code
Execution
Environment
Article

Whole Tale Vision
• Living publication
(data + code + environment)
• Increase odds of reproducibility
• Encourage investigation of results making it easy to
recreate the environment the result was created in
Article

Whole Tale Vision
Addressing reproducibility
Article
Tale
+

Whole Tale Vision
Tale
Data
{ Code
D1PROV

Whole Tale Team
NSF-DIBBS award: The Whole Tale: Merging Science and
Cyberinfrastructure Pathways ($5M total, over 5 years, 5 teams)
WT Team:
• Illinois (NCSA & iSchool)
• Bertram Ludäscher (PI), Kandace Turner (PM), Victoria Stodden (coPI), Matt
Turk (coPI)
• Kacper Kowalik (sw-architect), Craig Willis (sw-dev)
• U of Chicago
• Kyle Chard (coPI), Mihael Hategan (sw-dev)
• UT Austin
• Niall Gaffney (coPI), Siva Kulasekaran (sw-dev)
• U Notre Dame
• Jarek Nabrzyski (coPI), Ian Taylor (sw-dev), Adam Brinckman (sw-dev)
• UCSB
• Matt Jones (coPI), Bryce Mecum (sw-dev)

DEMO!
53

Last not least:
Non-unitary syntheses
of systematic knowledge
Please
@taxonbytes
Nico Franz
School of Life Sciences, Arizona State University
CIRSS Seminar – Center for Informatics Research in Science and Scholarship
February 17, 2017 – iSchool, University of Illinois Urbana-Champaign
@ http://www.slideshare.net/taxonbytes/franz-2017-uiuc-cirss-non-unitary-syntheses-of-systematic-knowledge 54

http://taxonbytes.org/wp-content/uploads/2014/10/Peet-BIGCB-2014-Changing-Perspectives-on-Plant-Distributions.pdf56

Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
"Taxonomic concept labels"
identify input concept regions
RCC–5 articulations provided
for each species-level concept
• Input visualization: MSW3 (2005) versus MSW2 (1993)
Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023
57

• Alignment visualization: "grey means taxonomically congruent"
58

One name &
congruent region
Many names &
congruent region
One name &
non-congruent regions
Many names &
non-congruent regions
New names &
exclusive regions
• Application of coverage constraint: parent-to-parent articulations (><) are
fully defined by alignment signal propagated from their respective children.
è Sensible when complete sampling of children is intended.
• Alignment visualization: "grey means taxonomically congruent"
59

1 in 3 names is unreliable across MSW2/MSW3 classifications
Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023
60

The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
Expert views
are in
conflict
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
61

The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
Impact:
Name-based aggregation has created
a novel synthesis that nobody believes in
"Just bad"
62

The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
"Just
bad"
Expert views
are
reconciled
Solution:
Instead of aggregating
an artificial 'consensus',
build translation services
63

Leaving taxon and species headaches …
• To illustrate Euler think of a simpler use case:
• Agreeing to disagree!
• … when there are multiple, legitimate
perspectives
• Sorting things out!
– Euler as a taxon concept (& name) “microscope” ...
– .. or scalpel
– .. or ...?
64

Yi-Yun Cheng1, Nico Franz2, Jodi Schneider1, Shizhuo Yu3, Thomas Rodenhausen4, Bertram Ludäscher1
1
School of Information Sciences, University of Illinois at Urbana-Champaign; 2
School of Life Sciences, Arizona State University;
3
Department of Computer Science, University of California at Davis; 4
School of Information, University of Arizona
Agreeing to Disagree: Reconciling Conflicting Taxonomic Views
using a Logic-based Approach
Acknowledgments
Support of the authors’ research through the National Science
Foundation is kindly acknowledged (DEB-1155984, DBI-1342595, and
DBI-1643002). The authors thank Professor Kathryn La Barre for her
comments and suggestions. We would also like to thank Dr. Laetitia
Navarro and Jeff Terstriep for help with creating map overlays in QGIS.
CONCLUSION
• Our logic-based taxonomy alignment approach can be used to solve
crosswalking issues
We will be able to mitigate the membership condition problems that
occur in equivalent crosswalking.
• RCC-5 approach preserves the original taxonomies while providing an
alignment view
We can solve data integration problems that happen in the more
coarse-grained relative crosswalking, which otherwise is subjected to
information loss.
• Our study also underscores the benefits of designing different
alignment workflows (Bottom up vs. Top-down) to match the needs
of specific taxonomy alignment problems
Bottom-up approach: seems to work well whenever we have non-
overlapping relationships at the leaf-level (lowest-level) articulations,
and we are not sure how the higher-level concepts should be aligned.
Top-down approach: seems favorable when there is an expectation of
certain higher-level articulations in conjunction with under-specified,
complex, and often overlapping leaf-level relations.
RELATED WORK
• Taxonomy Alignment Problems (TAP)
Taxonomies T1, T2 are inter-linked via a set of input articulations A,
defined as RCC-5 relations, to yield a “merged” taxonomy T3 .
• Euler/X
Articulations – a constraint or rule that defines a relationship (a set
constraint) between two concepts from different taxonomies .
Region Connection Calculus (RCC-5)
Possible Worlds – When encoding and solving TAPs via ASP, the
different answer sets represent alternative taxonomy merge solutions
or possible worlds (PWs).
INTRODUCTION
Tina: Hey Amy, can you recommend a signature dish from where you
live?
Amy: Oh, definitely the half-smokes from the Northeast! They are
these tasty half-pork and half-beef sausages.
Tina: What a coincidence! We have half-smokes in the South, too!
Where do you live in the Northeast? New York? Boston?
Amy: Wrong guesses! Where do you live in the South?
Tina and Amy together: Washington, D.C.
[The two of them look at each other, confused.]
“In the face of incompatible information or data structures among
users or among those specifying the system, attempts to create
unitary knowledge categories are futile. Rather, parallel or multiple
representational forms are required…” (Bowker & Star, 2000).
CASE 1 RESULTS: CEN vs. NDC
• State-level alignments are all congruent (Bottom-up)
• Inferred new articulations for regional-level alignments
CASE 2 RESULTS: CEN vs. TZ
Figure 3. (Left) CEN-NDC taxonomy alignment problem with 49 input articulations between TCEN and TNDC
Figure 4. (Right) The unique possible world (PW) T3 reconciling TCEN and TNDC via inferred relationships
Figure 1. National Diversity Council map (NDC) vs. Census Bureau map (CEN)
• Github link:
https://github.com/EulerProject/ASIST17
• Email: yiyunyc2@illinois.edu
West
Southwest Southeast
Midwest North-
east
West
South
Midwest North-
east
Pacific
Mountain
Central
Eastern
West
South
Midwest
North-
east
RESEARCH DESIGN
Step 1. Supply input taxonomies T1 and T2
Step 2. Formulate RCC-5 articulations between T1 and T2
Step 3. Iteratively edit articulations in Euler/X
Y X X YX Y X Y X Y
Congruence
X == Y
Inclusion
X > Y
Inverse Inclusion
X < Y
Overlap
X>< Y
Disjointness
X ! Y
T1
T2
T1
T2
Inconsistent (N=0)
Ambiguous (N>1)
T3
Add/Edit
Articulations A
Euler/X
N Possible Worlds
N=1 N=0 or N>1
R1
R2
R3
R4
R5
R6
R7
R8
R9
CEN.Midwest
CEN.USA
TZ.USA
CEN.West
CEN.Northeast
TZ.EasternCEN.Midwest
TZ.EasternCEN.South
CEN.South
CEN.South*TZ.Central
TZ.CentralCEN.Midwest
CEN.SouthTZ.Eastern
CEN.SouthTZ.Mountain
TZ.Central
CEN.MidwestTZ.Eastern
TZ.MountainCEN.South
TZ.Mountain
CEN.MidwestTZ.Mountain
TZ.MountainCEN.Midwest
CEN.Midwest*TZ.Mountain
CEN.MidwestTZ.Central
TZ.MountainCEN.West
CEN.Midwest*TZ.Eastern
CEN.West*TZ.Mountain
CEN.South*TZ.Mountain
CEN.SouthTZ.Central
TZ.Eastern
CEN.South*TZ.Eastern
CEN.Midwest*TZ.Central
TZ.CentralCEN.South
TZ.Pacific
CEN.WestTZ.Mountain
Nodes
CEN 4
newComb 18
comb 1
TZ 4
Edges
input 6
inferred 37
CEN.IL NDC.IL==
CEN.IN NDC.IN
==
CEN.RI NDC.RI==
CEN.IA NDC.IA==
CEN.WV NDC.WV
==
CEN.KS NDC.KS==
CEN.KY NDC.KY==
CEN.TX
NDC.TX
==
CEN.Northeast
CEN.VT
CEN.MA
CEN.ME
CEN.CT
CEN.PA
CEN.NY
CEN.NH
CEN.NJ
CEN.South
CEN.TN
CEN.MS
CEN.MD
CEN.DC
CEN.DE
CEN.VA
CEN.FL
CEN.AR
CEN.AL
CEN.OK
CEN.SC
CEN.LA
CEN.GA
CEN.NC
CEN.ID NDC.ID==
NDC.TN==
CEN.WY NDC.WY==
NDC.VT==
NDC.MS==
CEN.MT NDC.MT==
NDC.MA
==
CEN.USA
CEN.Midwest
CEN.West
NDC.ME==
NDC.MD==
CEN.MI NDC.MI==
CEN.MN NDC.MN==
NDC.DC==
NDC.DE==
CEN.OR NDC.OR==
CEN.OH NDC.OH==
NDC.VA==
NDC.FL==
NDC.AR==
CEN.AZ NDC.AZ==
NDC.AL==
NDC.OK
==
NDC.CT==
CEN.CO NDC.CO
==
CEN.CA NDC.CA==
CEN.SD NDC.SD
==
NDC.SC==
CEN.MO
CEN.ND
CEN.NE
CEN.WI
NDC.LA==
NDC.MO==
CEN.UT NDC.UT==
NDC.GA==
NDC.PA==
CEN.NV
CEN.NM
CEN.WA
NDC.NY==
NDC.NV==
NDC.NM==
NDC.WA
==
NDC.NH==
NDC.NJ==
NDC.ND==
NDC.NE==
NDC.WI==
NDC.NC==
NDC.West
NDC.Midwest
NDC.Northeast
NDC.Southeast
NDC.USA
NDC.Southwest
Nodes
CEN 54
NDC 55
Edges
isa_CEN 53
isa_NDC 54
Art. 49
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
CEN.Northeast
NDC.Northeast
CEN.South
NDC.Southeast
NDC.West
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.ND
NDC.ND
CEN.Midwest
NDC.Midwest
CEN.AZ
NDC.AZ
CEN.CA
NDC.CA
CEN.MT
NDC.MT
CEN.MA
NDC.MA
CEN.IN
NDC.IN
CEN.NV
NDC.NV
CEN.MD
NDC.MD
CEN.CT
NDC.CT
CEN.NH
NDC.NH
CEN.KY
NDC.KY
CEN.PA
NDC.PA
CEN.CO
NDC.CO
CEN.WA
NDC.WA
CEN.MI
NDC.MI
CEN.VA
NDC.VA
CEN.WI
NDC.WI
CEN.NE
NDC.NE
CEN.SD
NDC.SD
CEN.MN
NDC.MN
CEN.MS
NDC.MS
CEN.ID
NDC.ID
CEN.WV
NDC.WV
CEN.NY
NDC.NY
CEN.NJ
NDC.NJ
CEN.UT
NDC.UT
CEN.ME
NDC.ME
CEN.IL
NDC.IL
CEN.TN
NDC.TN
CEN.VT
NDC.VT
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.MO
NDC.MO
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.OH
NDC.OH
CEN.IA
NDC.IA
CEN.KS
NDC.KS
CEN.RI
NDC.RI
CEN.WY
NDC.WY
CEN.FL
NDC.FL
CEN.OR
NDC.OR
CEN.AL
NDC.AL
Nodes
CEN 3
NDC 4
comb 51
Edges
input 61
inferred 3
overlapsinferred 3
CEN.Northeast
TZ.Eastern
<
CEN.Midwest
><
TZ.Mountain
><
TZ.Pacific
!
CEN.South
><
><
!
TZ.Central
><
CEN.USA
CEN.West
TZ.USA
==
!
><
!
Nodes
CEN 5
TZ 5
Edges
isa_CEN 4
isa_TZ 4
Art. 12
CEN.Midwest
CEN.USA
TZ.USA
TZ.Eastern
TZ.Central
TZ.Mountain
CEN.South
CEN.Northeast
CEN.West TZ.Pacific
Nodes
CEN 4
comb 1
TZ 4
Edges
input 7
overlapsinput 6
overlapsinferred 1
R1
R2
R3
R4
R5
R6
R7
R8
R9
Figure 2. The process of aligning
taxonomies T1 and T2 with Euler/X
Figure 5. Top-down
input alignments
between TCEN and TTZ
Figure 6. The unique
PW for the TCEN with
TTZ alignment
Figure 10. Combined concepts
solution for TCEN and TTZ
taxonomy CEN Census_Regions
(USA Northeast Midwest South West)
(Northeast CT MA ME NH NJ NY PA RI VT)
(Midwest IL IN IA KS MI MN MO NE ND OH
SD WI)
(South AL AR DE DC FL GA KY LA MD MS NC
OK SC TN TX VA WV)
(West AZ CA CO ID MT NV NM OR UT WA WY)
taxonomy NDC
National_Diversity_Council
(USA Midwest Northeast Southeast
Southwest West)
(Northeast CT DC DE MD MA ME NH NJ NY
PA RI VT)
(Midwest IA IL IN KS MI MN MO ND NE OH
SD WI)
(Southeast AL AR FL GA KY LA MS NC SC
TN VA WV)
(Southwest AZ NM OK TX)
(West CA CO ID MT NV OR WA WY UT)
articulations CEN NDC
[CEN.AL equals NDC.AL]
[CEN.AR equals NDC.AR]
[CEN.AZ equals NDC.AZ]
[CEN.CA equals NDC.CA]
[CEN.CO equals NDC.CO]
[CEN.CT equals NDC.CT]
[CEN.DC equals NDC.DC]
[CEN.DE equals NDC.DE]
[CEN.FL equals NDC.FL]
[CEN.GA equals NDC.GA]
[CEN.IA equals NDC.IA]
[CEN.ID equals NDC.ID]
[CEN.IL equals NDC.IL]
[CEN.IN equals NDC.IN]
[CEN.KS equals NDC.KS]
[CEN.KY equals NDC.KY]
[CEN.LA equals NDC.LA]
[CEN.MA equals NDC.MA]
[CEN.MD equals NDC.MD]
[CEN.ME equals NDC.ME]
[CEN.MI equals NDC.MI]
[CEN.MN equals NDC.MN]
...
Quick Scan!
taxonomy CEN Census_Regions
(USA Midwest South West Northeast)
taxonomy TZ Time_Zone
(USA Pacific Mountain Central Eastern)
articulations CEN TZ
[CEN.Midwest disjoint TZ.Pacific]
[CEN.Midwest overlaps TZ.Eastern]
[CEN.Midwest overlaps TZ.Mountain]
[CEN.Northeast is_included_in TZ.Eastern]
[CEN.South disjoint TZ.Pacific]
[CEN.South overlaps TZ.Central]
[CEN.South overlaps TZ.Eastern]
[CEN.South overlaps TZ.Mountain]
[CEN.USA equals TZ.USA]
[CEN.West disjoint TZ.Central]
[CEN.West disjoint TZ.Eastern]
[CEN.West overlaps TZ.Mountain]

Two Taxonomies: NDC vs CEN
“…in the face of incompatible information or data structures among users or among those
specifying the system, attempts to create unitary knowledge categories are futile. Rather, parallel
or multiple representational forms are required” [Bowker & Star, 2000, p.159]
West
Southwest Southeast
Midwest North-
east
West
South
Midwest North-
east
National Diversity Council map (NDC) US Census Buero map (CEN)
Source: Yi-Yun (Jessica) Cheng (PhD student, iSchool @ Illinois)

The taxonomies
11/01/17
Cheng
• The Census Regions Map (CEN), consists of four regions: West,
Midwest, Northeast, and South, i.e., the contiguous 48 states
and Washington D.C.
West
South
Midwest
North-
east

The taxonomies
• The National Diversity Council Map (NDC), consists of five
regions: West, Southwest, Midwest, Northeast, Southeast, the
48 states and Washington D.C.
NDC (with states)
West
Southwest Southeast
Midwest North-
east
• NDC splits South into
SW and SE
• Do NDC and CEN
agree on “West”?
“Midwest”? …
• How can we sort this
out?

Sorting things out …
11/01/17
Cheng
CEN.Midwest
CEN.USA
CEN.South CEN.West CEN.Northeast NDC.Northeast
NDC.USA
NDC.Southeast NDC.Midwest NDC.Southwest NDC.West
Nodes
CEN 5
NDC 6
Edges
is_a (CEN) 4
is_a (NDC) 5
CEN.South
NDC.Northeast
o
NDC.Southwest
o
NDC.Southeast>
CEN.Midwest
NDC.Midwest=
CEN.USA
CEN.West
CEN.Northeast
NDC.USA
=
!
o
NDC.West
>
<
CEN.Midwest
CEN.USA
NDC.USA
Nodes
CEN 5
NDC 6
Edges
is_a (CEN) 4
is_a (NDC) 5
• Given:
– taxonomies T1, T2
– and relations T1 ~ T2
(articulations, alignment)
• Find:
– merged taxonomy T3
• Such that:
– T1, T2 are preserved
– all pairwise relations are
explicit
T1 T2

5 ways to relate concepts (regions)
• Idea: relate concepts X and Y with
articulations
• Articulation Language: Region
Connection Calculus (RCC5): congruence,
inclusion, inverse inclusion, overlap,
disjointness
Y X X YX Y X Y X Y
Congruence
X == Y
Inclusion
X > Y
Inverse Inclusion
X < Y
Overlap
X>< Y
Disjointness
X ! Y
CEN.South
NDC.Northeast
><
NDC.Southwest
><
NDC.Southeast>
CEN.Midwest
NDC.Midwest==
CEN.USA
CEN.West
CEN.Northeast
NDC.USA
==
!
><
NDC.West
>
<

Merged taxonomy T3
CEN.South
NDC.Northeast
NDC.Southwest
CEN.USA
NDC.USA
CEN.West
CEN.Northeast
NDC.Southeast
NDC.West
CEN.Midwest
NDC.Midwest
con
is_a
overla
CEN.Midwest
CEN.USA
NDC.USA
Nodes
CEN 5
NDC 6
Edges
is_a (CEN) 4
is_a (NDC) 5
CEN.Midwest
CEN.USA
NDC.USA
Nodes
CEN 5
NDC 6
Edges
is_a (CEN) 4
is_a (NDC) 5
CEN.South
NDC.Northeast
><
NDC.Southwest
><
NDC.Southeast>
CEN.Midwest
NDC.Midwest==
CEN.USA
CEN.West
CEN.Northeast
NDC.USA
==
!
><
NDC.West
>
<
5
6
4
5
s 9
T1 T2
T1 ~ T2 T3

How we align two taxonomies T1 and T2
• Step 1. Supply input taxonomies T1
and T2
• Step 2. Describe the relationships
between T1 and T2
• Step 3. Iteratively edit articulations
in Euler/X
T1
T2
T1
T2
Inconsistent (N=0)
Ambiguous (N>1)
T3
Add/Edit
Articulations A
Euler/X
N Possible Worlds
N=1 N=0 or N>1
• … but where do the articulations
come from??
– expert opinion
– automatically derived from data

Case 1: Census Region vs. National
Diversity Council
Cheng
West
South
Midwest
North-
east
NDC (with states)
West
Southwest Southeast
Midwest North-
east
CEN NDC
come from??
– expert input

11/01/17
Cheng
CEN.IL NDC.IL==
CEN.IN NDC.IN
==
CEN.RI NDC.RI==
CEN.IA NDC.IA==
CEN.WV NDC.WV
==
CEN.KS NDC.KS==
CEN.KY NDC.KY==
CEN.TX
NDC.TX
==
CEN.Northeast
CEN.VT
CEN.MA
CEN.ME
CEN.CT
CEN.PA
CEN.NY
CEN.NH
CEN.NJ
CEN.South
CEN.TN
CEN.MS
CEN.MD
CEN.DC
CEN.DE
CEN.VA
CEN.FL
CEN.AR
CEN.AL
CEN.OK
CEN.SC
CEN.LA
CEN.GA
CEN.NC
CEN.ID NDC.ID==
NDC.TN==
CEN.WY NDC.WY==
NDC.VT==
NDC.MS==
CEN.MT NDC.MT==
NDC.MA
==
CEN.USA
CEN.Midwest
CEN.West
NDC.ME==
NDC.MD==
CEN.MI NDC.MI==
CEN.MN NDC.MN==
NDC.DC==
NDC.DE==
CEN.OR NDC.OR==
CEN.OH NDC.OH==
NDC.VA==
NDC.FL==
NDC.AR==
CEN.AZ
NDC.AZ==
NDC.AL==
NDC.OK
==
NDC.CT==
CEN.CO NDC.CO
==
CEN.CA NDC.CA==
CEN.SD NDC.SD
==
NDC.SC==
CEN.MO
CEN.ND
CEN.NE
CEN.WI
NDC.LA==
NDC.MO==
CEN.UT NDC.UT==
NDC.GA==
NDC.PA==
CEN.NV
CEN.NM
CEN.WA
NDC.NY==
NDC.NV==
NDC.NM==
NDC.WA
==
NDC.NH==
NDC.NJ==
NDC.ND==
NDC.NE==
NDC.WI==
NDC.NC==
NDC.West
NDC.Midwest
NDC.Northeast
NDC.Southeast
NDC.USA
NDC.Southwest
Nodes
CEN 54
NDC 55
Edges
sa_CEN 53
sa_NDC 54
Art. 49
CEN.IL NDC.IL==
CEN.IN NDC.IN
==
CEN.IA NDC.IA==
CEN.WV NDC.WV
==
CEN.KS NDC.KS==
CEN.TX
NDC.TX
==
CEN.South
CEN.TN
CEN.MS
CEN.AL
CEN.OK
CEN.SC
CEN.LA
CEN.GA
CEN.NC
NDC.TN==
NDC.MS==
CEN.USA
CEN.Midwest
CEN.MI NDC.MI==
CEN.MN NDC.MN==
CEN.OH NDC.OH==
CEN.AZ
NDC.AZ==
NDC.AL==
NDC.OK
==
CEN.SD NDC.SD
==
NDC.SC==
CEN.MO
CEN.ND
CEN.NE
CEN.WI
NDC.LA==
NDC.MO==
NDC.GA==
CEN.NM
NDC.NM==
NDC.ND==
NDC.NE==
NDC.WI==
NDC.NC==
NDC.Midwest
NDC.Southeast
NDC.USA
NDC.Southwest
Nodes
CEN 54
NDC 55
Edges
isa_CEN 53
isa_NDC 54
Art. 49

11/01/17
Cheng
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
CEN.Northeast
NDC.Northeast
CEN.South
NDC.Southeast
NDC.West
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.ND
NDC.ND
CEN.Midwest
NDC.Midwest
CEN.AZ
NDC.AZ
CEN.CA
NDC.CA
CEN.MT
NDC.MT
CEN.MA
NDC.MA
CEN.IN
NDC.IN
CEN.NV
NDC.NV
CEN.MD
NDC.MD
CEN.CT
NDC.CT
CEN.NH
NDC.NH
CEN.KY
NDC.KY
CEN.PA
NDC.PA
CEN.CO
NDC.CO
CEN.WA
NDC.WA
CEN.MI
NDC.MI
CEN.VA
NDC.VA
CEN.WI
NDC.WI
CEN.NE
NDC.NE
CEN.SD
NDC.SD
CEN.MN
NDC.MN
CEN.MS
NDC.MS
CEN.ID
NDC.ID
CEN.WV
NDC.WV
CEN.NY
NDC.NY
CEN.NJ
NDC.NJ
CEN.UT
NDC.UT
CEN.ME
NDC.ME
CEN.IL
NDC.IL
CEN.TN
NDC.TN
CEN.VT
NDC.VT
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.MO
NDC.MO
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.OH
NDC.OH
CEN.IA
NDC.IA
CEN.KS
NDC.KS
CEN.RI
NDC.RI
CEN.WY
NDC.WY
CEN.FL
NDC.FL
CEN.OR
NDC.OR
CEN.AL
NDC.AL
Nodes
CEN 3
NDC 4
comb 51
Edges
input 61
inferred 3
overlapsinferred 3
CEN.Northeast
CEN.ND
NDC.ND
CEN.Midwest
NDC.Midwest
CEN.MA
NDC.MA
CEN.IN
NDC.IN
CEN.CT
NDC.CT
CEN.NH
NDC.NH
CEN.PA
NDC.PA
CEN.MI
NDC.MI
CEN.WI
NDC.WI
CEN.NE
NDC.NE
CEN.SD
NDC.SD
CEN.MN
NDC.MN
CEN.NY
NDC.NY
CEN.NJ
NDC.NJ
CEN.ME
NDC.ME
CEN.IL
NDC.IL
CEN.VT
NDC.VT
CEN.MO
NDC.MO
CEN.OH
NDC.OH
CEN.IA
NDC.IA
CEN.KS
NDC.KS
CEN.RI
NDC.RI
Nod
CEN
NDC
comb
Edg
input
inferre
overlapsinf
USA, Midwest and State-level
alignments are all congruent

11/01/17
Cheng
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
CEN.Northeast
NDC.Northeast
CEN.South
NDC.Southeast
NDC.West
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.ND
NDC.ND
CEN.Midwest
NDC.Midwest
CEN.AZ
NDC.AZ
CEN.CA
NDC.CA
CEN.MT
NDC.MT
CEN.MA
NDC.MA
CEN.IN
NDC.IN
CEN.NV
NDC.NV
CEN.MD
NDC.MD
CEN.CT
NDC.CT
CEN.NH
NDC.NH
CEN.KY
NDC.KY
CEN.PA
NDC.PA
CEN.CO
NDC.CO
CEN.WA
NDC.WA
CEN.MI
NDC.MI
CEN.VA
NDC.VA
CEN.WI
NDC.WI
CEN.NE
NDC.NE
CEN.SD
NDC.SD
CEN.MN
NDC.MN
CEN.MS
NDC.MS
CEN.ID
NDC.ID
CEN.WV
NDC.WV
CEN.NY
NDC.NY
CEN.NJ
NDC.NJ
CEN.UT
NDC.UT
CEN.ME
NDC.ME
CEN.IL
NDC.IL
CEN.TN
NDC.TN
CEN.VT
NDC.VT
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.MO
NDC.MO
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.OH
NDC.OH
CEN.IA
NDC.IA
CEN.KS
NDC.KS
CEN.RI
NDC.RI
CEN.WY
NDC.WY
CEN.FL
NDC.FL
CEN.OR
NDC.OR
CEN.AL
NDC.AL
Nodes
CEN 3
NDC 4
comb 51
Edges
input 61
inferred 3
overlapsinferred 3
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
NDC.Northeast
CEN.South
NDC.Southeast
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.AZ
NDC.AZ
CEN.MA
NDC.MA
CEN.MD
NDC.MD
CEN.CT
CEN.KY
NDC.KY
CEN.VA
NDC.VA
CEN.MS
NDC.MS
CEN.WV
NDC.WV
CEN.TN
NDC.TN
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.FL
NDC.FL
CEN.AL
NDC.AL
The overlapping relations are
automatically derived from data

11/01/17
Cheng
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
CEN.Northeast
NDC.Northeast
CEN.South
NDC.Southeast
NDC.West
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.ND
NDC.ND
CEN.Midwest
NDC.Midwest
CEN.AZ
NDC.AZ
CEN.CA
NDC.CA
CEN.MT
NDC.MT
CEN.MA
NDC.MA
CEN.IN
NDC.IN
CEN.NV
NDC.NV
CEN.MD
NDC.MD
CEN.CT
NDC.CT
CEN.NH
NDC.NH
CEN.KY
NDC.KY
CEN.PA
NDC.PA
CEN.CO
NDC.CO
CEN.WA
NDC.WA
CEN.MI
NDC.MI
CEN.VA
NDC.VA
CEN.WI
NDC.WI
CEN.NE
NDC.NE
CEN.SD
NDC.SD
CEN.MN
NDC.MN
CEN.MS
NDC.MS
CEN.ID
NDC.ID
CEN.WV
NDC.WV
CEN.NY
NDC.NY
CEN.NJ
NDC.NJ
CEN.UT
NDC.UT
CEN.ME
NDC.ME
CEN.IL
NDC.IL
CEN.TN
NDC.TN
CEN.VT
NDC.VT
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.MO
NDC.MO
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.OH
NDC.OH
CEN.IA
NDC.IA
CEN.KS
NDC.KS
CEN.RI
NDC.RI
CEN.WY
NDC.WY
CEN.FL
NDC.FL
CEN.OR
NDC.OR
CEN.AL
NDC.AL
Nodes
CEN 3
NDC 4
comb 51
Edges
input 61
inferred 3
overlapsinferred 3
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
NDC.Northeast
CEN.South
NDC.Southeast
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.AZ
NDC.AZ
CEN.MA
NDC.MA
CEN.MD
NDC.MD
CEN.CT
CEN.KY
NDC.KY
CEN.VA
NDC.VA
CEN.MS
NDC.MS
CEN.WV
NDC.WV
CEN.TN
NDC.TN
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.FL
NDC.FL
CEN.AL
NDC.AL
DC is in both the South and the Northeast

Case 2: Census Region vs Time Zone
Cheng
Pacific
Mountain
Central
Eastern
West
South
Midwest
North-
east
CEN TZ
come from??
– expert input

Cheng
CEN.Northeast
TZ.Eastern
<
CEN.Midwest
><
TZ.Mountain
><
TZ.Paciﬁc
!
CEN.South
><
><
!
TZ.Central
><
CEN.USA
CEN.West
TZ.USA
==
!
><
!
CEN.Midwest
CEN.USA
TZ.USA
TZ.Eastern
TZ.Central
TZ.Mountain
CEN.South
CEN.Northeast
CEN.West TZ.Paciﬁc
Input
Output:
Possible World
Top-down regional alignment

How do we know if our ‘expert
articulations’ are correct?
11/01/17
Cheng
R1
R2
R3
R4
R5
R6
R7
R8
R9
GIS solution as the Ground Truth..

11/01/17
Cheng
R1
R2
R3
R4
R5
R6
R7
R8
R9
CEN.Midwest
CEN.USA
TZ.USA
CEN.West
CEN.Northeast
TZ.EasternCEN.Midwest
TZ.EasternCEN.South
CEN.South
CEN.South*TZ.Central
TZ.CentralCEN.Midwest
CEN.SouthTZ.Eastern
CEN.SouthTZ.Mountain
TZ.Central
CEN.MidwestTZ.Eastern
TZ.MountainCEN.South
TZ.Mountain
CEN.MidwestTZ.Mountain
TZ.MountainCEN.Midwest
CEN.Midwest*TZ.Mountain
CEN.MidwestTZ.Central
TZ.MountainCEN.West
CEN.Midwest*TZ.Eastern
CEN.West*TZ.Mountain
CEN.South*TZ.Mountain
CEN.SouthTZ.Central
TZ.Eastern
CEN.South*TZ.Eastern
CEN.Midwest*TZ.Central
TZ.CentralCEN.South
TZ.Paciﬁc
CEN.WestTZ.Mountain
N
CE
newC
co
T
E
Combined concepts solution
for regional-level alignments

Do the taxonomies have to be
spatial in order to use RCC-5?
• No! The more typical cases for taxonomy
alignment are usually between non-spatial
taxonomies
– for which no “GIS route” or direct visual cues
about regional extensions are available
– the use of RCC-5 as an alignment vocabulary is a
suitable approach to perform a wide range of
multi-hierarchy reconciliations
Cheng

Conclusion & Discussion
• Underscores the benefits of designing different
alignment workflows (Bottom-up vs. Top-Down)
– Bottom-up: non-overlapping relationships at the lowest-level
articulations, not sure how to align the higher-level concepts
– Top-Down: when there is often overlapping leaf-level relations..
Expert input will frequently be needed to establish such
expectations under the top-down approach
11/01/17
Cheng
yiyunyc2@illinois.edu

Implications
• Logic-based taxonomy alignment approach
– Disambiguate name-based taxonomy alignment over time
• 40% of the concepts in biology taxonomies undergoes
name change over time (Franz et al., 2016)
– May mitigate problems in equivalent crosswalking
• Membership condition problem that was often criticized in
crosswalking
– Preserves the original taxonomies while providing an
alignment view
• Solve data integration problems that happen in the more
coarse-grained relative crosswalking
11/01/17
Cheng
yiyunyc2@illinois.edu

• … Aristotle …
• … Euler …
• …
• … Greg Whitbread …
• [BPB93] J. H. Beach, S. Pramanik, and J. H. Beaman. Hierarchic
taxonomic databases.,Advances in Computer Methods for Systematic
Biology: Artificial Intelligence, Databases, Computer Vision, 1993
• [Ber95] Walter G. Berendsohn. The concept of “potential taxa” in
databases. Taxon, 44:207–212, 1995.
• [Ber03] Walter G. Berendsohn. MoReTax – Handling Factual Information
Linked to Taxonomic Concepts in Biology. No. 39 in Schriftenreihe für
Vegetationskunde. Bundesamt für Naturschutz, 2003.
• [GG03] M. Geoffroy and A. Güntsch. Assembling and navigating the
potential taxon graph. In [Ber03], pages 71–82, 2003.
• [TL07] Thau, D., & Ludäscher, B. (2007). Reasoning about taxonomies in
first-order logic. Ecological Informatics, 2(3), 195-209.
• [FP09] Franz, N. M., & Peet, R. K. (2009). Perspectives: towards a
language for mapping relationships among taxonomic concepts.
Systematics and Biodiversity, 7(1), 5-20.
• … 85
Some History

ETC & Authors in the Driver's Seat

Recommended

Recommended

More Related Content

Similar to ETC & Authors in the Driver's Seat

Similar to ETC & Authors in the Driver's Seat (20)

More from Bertram Ludäscher

More from Bertram Ludäscher (20)

Recently uploaded

Recently uploaded (20)

ETC & Authors in the Driver's Seat