Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Community web sites: small pieces loosely joined


Published on

A presentation given by Dave Roberts and coauthored by David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith. This was given at the Fourth Metadata and Semantics Research Conference (MTSR 2010) at Acala de Henares, Madrid, in the premises of the Faculty of Law.

Published in: Technology
  • Be the first to comment

Community web sites: small pieces loosely joined

  1. 1. ViBRANT Virtual Biodiversity Community web sites: small pieces loosely joined Dave Roberts, David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith SEVENTH FRAMEWORK PROGRAMME -infrastructure
  2. 2. ViBRANT Virtual Biodiversity SEVENTH FRAMEWORK PROGRAMME -infrastructure
  3. 3. ViBRANT Virtual Biodiversity Small pieces loosely joined Has many potential meanings: Joining contributors together to form communities Joining the data together that go towards forming a Scratchpad Joining Scratchpad content with the landscape of biodiversity informatics data on the web SEVENTH FRAMEWORK PROGRAMME -infrastructure
  4. 4. ViBRANT Virtual Biodiversity Addressing the challenges of taxonomy Goal ... Inventory the Earth’s species Document their relationships “Publish” & apply these data Data set ... 1.8 M described spp. (10M names) 300M pages (over last 250 years) 1.5-3B specimens People ... 4-6,000 taxonomists 30-40,000 “pro-amateurs” Many more citizen scientists? SEVENTH FRAMEWORK PROGRAMME -infrastructure
  5. 5. ViBRANT Virtual Biodiversity I The technology must largely embody the cause–effect relationship connecting problem to solution. II The effects of the technological fix must be assessable using relatively unambiguous or uncontroversial criteria. III Research and development is most likely to contribute decisively to solving a social problem when it focuses on improving a standardized technical core that already exists. Sarewitz and Nelson (2008) Three rules for technological fixes. Nature, 456: 871-872 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  6. 6. ViBRANT Virtual Biodiversity Biodiversity - a kind of washing powder? When 2010 was named as the "year of biodiversity" by the UN, it began with a plea to save the world's ecosystems. UN Secretary-General Ban Ki-moon said: "Biological diversity underpins ecosystem functioning... its continued loss, therefore, has major implications for Recently, members of the public current and future human well-being." were asked what biodiversity is. The most common answer was "some kind of washing powder". 15 October 2010 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  7. 7. ViBRANT Virtual Biodiversity Addressing the challenges of biodiversity informatics “…the field [of biodiversity informatics] appears to be growing in a void of overarching, motivating questions, effectively making it a set of technologies in search of questions to address.” Peterson et al, Syst. & Biodiv. 2010 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  8. 8. ViBRANT Virtual Biodiversity Scratchpads Hosted websites for taxonomists Research & publication platform Modular (Drupal) & flexible Supports the taxonomic workflow Bottom-up design, agile dev. Ecosystem of communities (185) 2,350+ users (unpaid) from 2007 ViBRANT follow on, €4.75M SEVENTH FRAMEWORK PROGRAMME -infrastructure
  9. 9. ViBRANT Virtual Biodiversity Taxonomy & Literature DNA, Phylogeny & Specimens 2.3k users, 58 countries, 268k pages 185 "Virtual Research Communities" EDIT, GBIF, NHM, & EOL Platform for biodiversity research & data publication eBooks eJournals Changing the nature of collaboration Expanding opportunities to participate in science Image Galleries Societies & Organizations SEVENTH FRAMEWORK PROGRAMME -infrastructure
  10. 10. ViBRANT Virtual Biodiversity A website for you & your community Magic Your data Your web site SEVENTH FRAMEWORK PROGRAMME -infrastructure
  11. 11. ViBRANT Virtual Biodiversity Taxonomy import, management and navigation SEVENTH FRAMEWORK PROGRAMME -infrastructure
  12. 12. ViBRANT Virtual Biodiversity Reference manager / Endnote support for bibliographies SEVENTH FRAMEWORK PROGRAMME -infrastructure
  13. 13. ViBRANT Virtual Biodiversity Image galleries, image upload & annotation SEVENTH FRAMEWORK PROGRAMME -infrastructure
  14. 14. ViBRANT Virtual Biodiversity Nexus / Newick import for visualizing phylogenies SEVENTH FRAMEWORK PROGRAMME -infrastructure
  15. 15. ViBRANT Virtual Biodiversity Molecular & morphological character matricies (discrete, morphometric and text characters) SEVENTH FRAMEWORK PROGRAMME -infrastructure
  16. 16. ViBRANT Virtual Biodiversity Presence / absence country maps SEVENTH FRAMEWORK PROGRAMME -infrastructure
  17. 17. ViBRANT Virtual Biodiversity Specimen & location records (DwC) SEVENTH FRAMEWORK PROGRAMME -infrastructure
  18. 18. ViBRANT Virtual Biodiversity Static web pages Web fora with e-mail integration Newsletters with User blogs e-mail integration SEVENTH FRAMEWORK PROGRAMME -infrastructure
  19. 19. ViBRANT Virtual Biodiversity Import from CSV text file to any content type SEVENTH FRAMEWORK PROGRAMME -infrastructure
  20. 20. ViBRANT Virtual Biodiversity ViBRANT Products A Virtual Research Environment (Scratchpads) where users can safely store, share and manage their research information. Analytical services for users to build identification keys and phylogenetic trees. A publication platform for users to automatically compile taxonomic manuscripts from their research database. A portal for users to centrally access publicly accessible biodiversity research information and literature. Training, support & sociological study, helping research communities to use these tools and services. A standards compliant technical architecture that can be sustained by biodiversity research community. SEVENTH FRAMEWORK PROGRAMME -infrastructure
  21. 21. ViBRANT Virtual Biodiversity Training Biodiversity & outreach data programme standards Networking User Controlled feedback vocabulary The “chromosome” systems platform WP3. Training User sociology Data aggregation WP4. Standards study portal WP8. Mobilisation Field GBIF recording integration support activities Citizen Biodiversity science visualisation programme layers Scratchpads Virtual Research Service Research Environment Distributed Phylogenetic Scratchpad analysis hosting Bioclimatic Software WP5. Data modelling & metrics module integration WP2. Architecture WP6. Publishing WP7. Literature Identification Sustainability tools plan Communal Matrix data biodiversity editor literature Biodiversity Biodiversity data literature publishing markup Scholarly Biodiversity manuscript datamining publishing SEVENTH FRAMEWORK PROGRAMME -infrastructure
  22. 22. ViBRANT Virtual Biodiversity Biodiversity literature looks like this Cues Indented text UPPER CASE TEXT Bold text Italic text Latin Keywords Symbols SEVENTH FRAMEWORK PROGRAMME -infrastructure
  23. 23. ViBRANT Virtual Biodiversity Adobe Reader has this M BRITISH MUSEUM (NATURAL HiSi 26JU PRESENTED GENERAL UC.-lARY Bulletin ofthe BritishMuseum (Natural History) The ichneumon-fly genus Banchus in the OldWorld (Hymenoptera) M. G. Fitton series Entomology Vol51 Nol 25 July 1985 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  24. 24. ViBRANT Virtual Biodiversity Lura (BHL) has this M BRITISH MUSEUM (NATURAL HiSi 26 JU PRESENTED GENERAL UC.-lARY Bulletin of the British Museum (Natural History) The ichneumon-fly genus Banchus (Hymenoptera) in the Old World M. G. Fitton Entomology series Vol51 Nol 25 July 1985 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  25. 25. ViBRANT Virtual Biodiversity But choice of XML schema is important ABBYY XML is very detailed This line of text has 202 bytes: The Bulletin of the British Museum (Natural History), instituted in 1949, is issued in fourscientific series, Botany, Entomology, Geology (incorporating Mineralogy) and Zoology,and an Historical series. To encode in ABBYY XML format this line requires 45,533 bytes. There are 84,263 lines in the document from which this example was taken. SEVENTH FRAMEWORK PROGRAMME -infrastructure
  26. 26. ViBRANT Virtual Biodiversity Look for taxon names Used uBio FindIT web service Overall excellent Especially as add Namebank ID But still some oddities Genus = ‘The’ The scutellum The primitive Species or Author = ‘and’ Exetastes and B[anchus] falcatorius and SEVENTH FRAMEWORK PROGRAMME -infrastructure
  27. 27. ViBRANT Virtual Biodiversity Look for paragraph types Simple keyword matching Surprisingly effective! Issue – can identify start, but not end… Follow up work Punctuation Concepts SEVENTH FRAMEWORK PROGRAMME -infrastructure
  28. 28. ViBRANT Virtual Biodiversity Look for other proper names Biologia Centrali-Americana has a gazetteer Most journals do not Generic solution = OpenCalais Good accuracy Old countries D.D.R. West Germany Continents America SEVENTH FRAMEWORK PROGRAMME -infrastructure
  29. 29. ViBRANT Virtual Biodiversity Ambiguities and Mis-identifications New York Other Oddities City Persons State Surname only Washington Two part names City Van Veen State van Veen Lake George Regions and Continents City East Africa Lake Victoria Africa City SEVENTH FRAMEWORK PROGRAMME -infrastructure
  30. 30. ViBRANT Virtual Biodiversity Negative spell checking Go beyond stop words Remove everything not in a spell dictionary Check: Minor Vulgar Bulletin 27 from the Zoology Series reduced From 139,034 to 5,219 words SEVENTH FRAMEWORK PROGRAMME -infrastructure
  31. 31. ViBRANT Virtual Biodiversity Ligatures INTRODUCTION. Volume, one of five required for the enumeration of the Rhynchophora, was THIS commenced by Dr. Sharp in 1889 and is now concluded by myself. The study of the " Otiorhynchinœ Alatse " has unfortunately been delayed for many years, during the publication of Vol. IV. parts 4, 5, and 7, all of which are devoted to the Family Curculionidœ. The present Volume, IV. part 3, includes the Subfamilies Attelabinae, Pterocolinœ, Allocoryninee, Apioninœ, Thecesterninae, and Otiorhynchinre. The Attelabinae are represented by 104 (88 new), the Pterocolinse by three (all new), the Allocoryninse (a new subfamily) and Thecesterninse each by one, the Apioninae by 88 (84 new), and the Otiorhynchinae by 419 (340 new) species respectively; the total number for the six subfamilies being 616 species, with 516 new, and forty new genera. Amongst the 419 Otiorhynchinae, the apterous and winged forms are almost equal in number, there being a preponderance of apterous terrestrial species (Eupagoderes, Epicœrus, Epayriopsis, &c.) in the arid portions of Mexico and the winged forms ÇExophthalmuS) &c.) becoming relatively more numerous in the forest regions southward. Taking the Curculionidœ as a whole—the subfamilies Curculioninae and Calandrinse, in addition to those worked out in the present Volume,—the number of species enumerated altogether from Central America is as follows :— Vol. IV. part 3, 616; IV. part 4, 1365; IV. part 5, 908; IV. part 7, 344 : total 3233. The three other families of Rhynchophora—the Brenthidae, Scolytidae, and SEVENTH FRAMEWORK PROGRAMME -infrastructure
  32. 32. ViBRANT Virtual Biodiversity Ligatures Otiorhynchinæ => Otiorhynchinœ Thecesterninæ => Thecesterninse Alatæ => Alatse Apioninæ => Apioninae Curculionidæ => Curculionidœ Otiorhynchinæ => Otiorhynchinae Attelabinæ => Attelabinae Otiorhynchinæ => Otiorhynchinae Pterocolinæ => Pterocolinœ Curculionidæ => Curculionidœ Allocoryninæ => Allocoryninee Curculioninæ => Curculioninae Apioninæ => Apioninœ Calandrinæ => Calandrinse Thecesterninæ => Thecesterninae Brenthidæ => Brenthidae Otiorhynchinæ => Otiorhynchinre Scolytidæ => Scolytidae Attelabinæ => Attelabinae Anthribidæ => Anthribidae Pterocolinæ => Pterocolinse Hispidæ => Hispida Allocoryninæ => Allocoryninse Cassididæ => Cassididae For the 24 æ there are: 11 ae; 5 œ; 5 se; 1 ee; 1 re; 1 a?; So not a single correct rendering of the ligature, æ. By contrast, the only example of œ in the page, Epicœrus, was correctly rendered. SEVENTH FRAMEWORK PROGRAMME -infrastructure
  33. 33. ViBRANT Virtual Biodiversity Soundex 831 elytra E436 831 elytra E436 639 prothorax P636 509 Elytra E436 637 Hab H100 294 elytris E436 616 punctate P523 125 elytral E436 578 millim M450 36 elytron E436 12 elytrisque E436 9 elytrorumque E436 8 Elytral E436 7 elytrorum E436 2 elytro E436 1 Elytrorum E436 1 Elytris E436 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  34. 34. ViBRANT Virtual Biodiversity Similar words? denticulate => denticulata Levenshtein distances of 1: 0,0,1 denticulate => reticulate Levenshtein distances of 2: 3,2,0 denticulate => geniculate Levenshtein distances of 2: 2,2,0 SEVENTH FRAMEWORK PROGRAMME -infrastructure
  35. 35. ViBRANT Virtual Biodiversity What did we achieve? Marked up 11 volumes, i.e. 4,504 pages Have robust workflow, can mark up a Bulletin in about 10-15 minutes. Choke point is call to OpenCalais web service No manual intervention or review required: workflow is scalable Recognising taxon names: Well uBio gives us a goods start, and we have techniques to cluster ALL mis-spellings and variants with a valid taxon; but not perfect, eg BanchusFabricius ends up in more than one cluster SEVENTH FRAMEWORK PROGRAMME -infrastructure
  36. 36. ViBRANT Virtual Biodiversity “making the Scratchpads better” More reliable (e.g., distribute the servers) More functional (e.g., phylogenetic & publication services) Easier to use (better workflows) Prettier (better graphical design - more intuitive) More integrated (for data stored inside & outside the Scratchpad framework) More sustainable (simple administration, distribute developers, development sandbox) “making natural history better” Easier to compile, manage and reuse your data Easier to find and reuse other peoples data Promoting your data inside & outside the taxonomic community Getting people to work for you (crowdsourcing) SEVENTH FRAMEWORK PROGRAMME -infrastructure
  37. 37. ViBRANT Virtual Biodiversity Author Manuscript Public Enhanced preparation on HTML a Scratchpad PDF Submit as XML Enhanced XML Printed paper Produce PDF Send to Register with reviewers ZooBank, Publisher GBIF, EoL etc. SEVENTH FRAMEWORK PROGRAMME -infrastructure
  38. 38. ViBRANT Virtual Biodiversity Thank you for your attention. Any questions SEVENTH FRAMEWORK PROGRAMME -infrastructure