3. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
PUBLISHING DATA
• Data published in papers
• Data papers published
• Data published
4. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
OUR TARGET DATA
Primary Biodiversity Data Record
PBR
5. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
PBR
Megaptera novaehollandiae
Adult female, live
Off North Truro, MA, USA
42.101 N, 70.169 W
2010.09.29 21:47 GMT
Arturo H. Ariño
Aboard Dolphin VI
Canon Eos 450D, 200 mm lens
un
6. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
GBIF GEOREFERENCED DATA
237.348.923 animal data records by Oct. 2012 (total georeferenced records: 327.048.532)
7. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
THE CASE FOR DIRECT DATA PUBLICATION
• Access to massive data increasingly
commonalized: GBIF
• Spectrum of possible uses increasing: new
science, new paradigms
• Data-Intensive Science
– Reliance on good data: Opportunity for discovery
– Reliance on bad data: Risk of “undiscoveries”
8. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
PBR
Nautilus pompilus
4 specimens
Off Palau Islands
1921
Legit :unknown
Det.: J.A. Salinas
Collection: JDR at MZNA
un
9. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
BARRIERS TO DIRECT DATA PUBLICATION
• Data availability
• Data sharing mechanisms
• Data publication incentives
• Data quality
10. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
DATA AVAILABILITY INCREASE
GBIF, October 2012
11. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Ariño, 2010. Biodiv. Informat. 7: 15-26
12. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
ESTIMATED DATA IN MISSING COLLECTIONS
BCI
GBIF
GSAP –DNHC
survey
unknown
est. CI
Cexp = 8.37K
Nexp = 2.01G
13. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
PBR
Saccharomyces cerevisae
TCP1-beta
Cask in wreck of ARGO
2 km E of Akta Képhalos
Stratum 2000 BC
Legit : Homer S.
Det.: LoScanSQ-X
Collection: Museum of Beer History
(MBH)
un
??
14. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
THE TRUST PARADOX
• Papers are generally more trusted than raw/downloadable data
– Papers have gone through peer review
• Published data have common sources:
– Experiments,
– Observations,
– Digitizations
• Raw data in published papers can go unreviewed
– Review focuses on soundness, methods, conclusions
– Data assumed to be true & correct
• Direct publication of data, in fact, should facilitate revision
– Enforcing rules
– Filtering
– Pattern detection
15. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
FITNESS-FOR-USE
• FFU defines whether data can be used for a
specific purpose
• Useful compromise for publishing data
• FFU not equal to data quality
Quality Fitness-for-use
Intrinsic to data Depends on intended use
Conceptual Pragmatical
Good quality predicting good FFU Good FFU not predict good quality
16. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
FFU ASSESSMENT
• In 2006 AHA started analyzing our own DB for
FFU, creating pattern-detection visualizations
– First reported in TDWG-2006 (St. Louis)
• In 2008 we started to analyze raw & processed
GBIF data (2.4G records by 2012) (JOT’s thesis)
– Building on works by Chapman, Yesson, Wieckzorek,
etc., changing scope and perspective
• Started producing reports in 2009, 2010
• Teamed up with GBIF-Sec, 2011
• Created BIDDSAT, 2012
17. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
19. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
DATA INDEXING
Provider
B
Provider
A
Provider
C
Provider
D
GBIF index
?
20. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
DATA QUERYING
Provider
B
Provider
A
Provider
C
Provider
D
GBIF index
?
21. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Análisis detallado de GBIF Detailed assessment of GBIF
Bad data Good data
22. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Otegui , Ariño, Gaiji & Chavan, in press
23. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
CONTROL AND FFU TOOLS AT INDEXING
Gaiji et al., 22011-2012 –
EMBARGOED DECEMBER 2012
24. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
CORRECTING MECHANISMS: EXAMPLES
• GBIF has implemented many georeferencing correction
algorithms, such as e.g. coordinate/country match
• This removes many bogus data points, for example redressing
reversed lat/long when serving data
• Still, original data need to be corrected: GBIF cannot alter
original data (only tag them)
David Remsen, TDWG-2011. In ViBRANT, http://vbrant.eu/content/gbif-integration
25. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
2010.04.28
2011.14.09 ErrCode: 10
10: Fields “month” and “day” probably swapped
FILTER MECHANISMS: EXAMPLE
• Original data unchanged
• Index entry corrected
• Error entry generated in issue log
26. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
-
+
Otegui et al., 2012
FILTERS CANNOT GET ALL
27. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Otegui , Ariño, Gaiji & Chavan, in press
FILTERS CANNOT SOLVE ALL
28. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Modified from Otegui et al. In press
All GBIF data
Some date
element wrong
Some date element
missing
All date elements
missing
29. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
BIDDSAT
• Tool to detect space-time and other patterns
• Applicable to data publishers sharing data
through GBIF
• Uses tailored visualizations
• http://www.unav.es/unzyec/mzna/biddsat/
• Open source: https://github.com/jotegui/BIDDSAT
• Bioinformatics, DOI: 10.1093/bioinformatics/BTS359
30. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
31. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
0 100
Percentage of completeness
Numberofcollections
015304560
Source: BIDDSAT
DATA COMPLETNESS
32. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
0 100
Percentage of completeness
Numberofcollections
015304560
• Wrong
implementation of
exchange standards
(DwC) – solvable
• Data loss – not
solvable
• Limited room for
improvement
Fuente: BIDDSAT
DATA COMPLETNESS
33. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Data Provider LEONIDAS, Resource SHIELD
GBIF 2008/05 Version
34. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Data Provider LEONIDAS, Resource SHIELD
GBIF 2009/09 Version
35. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
1/Jan31/Dec
1/Mar
1/Feb
1/Apr
1/May
1/Jun
1/Jul
1/Aug
1/Sep
1/Oct
1/Nov
1/Dec
Fall
Winter
Spring
Summer
1750 Year 2012
-
+
Cronhorogram. Introduced by
Ariño & Otegui, 2008, TDWG
36. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Source: BIDDSAT
37. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
- +
Hebdogram. Iintroduced by Ariño & Otegui, 2008. Proceedings of TDWG
38. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Ariño, Otegui & Robles, 2009
Provider 180
All datasets
39. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
2008/05
Data Provider
Codename:
BORODIN
40. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Data Provider Codename: BORODIN
2009/092008/05
41. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Treemap by Google Charts API on authors’ data
42. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Fungi
INDEX TAXONOMY
Gaiji et al. in press
43. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
• Patchy data publishing… also in papers
• Opportunistic behavior: “Low-hanging fruit”
• Data can (and will) evolve
• The human factor still counts
PATTERNS OF PATTERNS
44. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
0
5
10
15
20
25
0 5 10 15
Clase de distancia
Clasedeimprecisión
0 10000 20000 30000 40000
Chordata
Orthoptera
Lepidoptera
Hymenoptera
Diptera
Coleoptera
Thysanoptera
Collembola
Acari
Polychaeta
Oligochaeta
Nematoda
Georreferenciado
Localidad sin
coordenadas
Sin localidad
PAPER WOES: PBR FROM LITERATURE
45. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Publisher: Swedish
Publisher: German
Publisher: French
Publisher: British
Publisher: Norwegian
A MATTER OF CONVENIENCE
Otegui, Robles & Ariño, 2009. eBiosphere, London, UK.
Publisher: Parisien
Publisher: Spanish
46. JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHEDBIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
CLASSIFICATION
ACCORDING TO:
Ariño, Otegui & Robles, 2009
PROVIDERSP2K
GBIF RECORDS
SAMPLE
EVOLUTIONARY DATA