These are Cyndy Parr's presentations at the EOL Global Partner Summit, starting with an overview of the meeting, and including an overview of how we set up content partnerships, and how we calculate and use page richness scores.
Global content summit: Overview, content partnering, richness
1. Cynthia Parr Global Content Summit
Species Pages Group 17-19 Jan 2011
2. http://www.eol.org
• All species known to science
• Freely accessible: open
access, open source
• Available from a single portal
in a common format
• Quality
• Constantly growing
• Aimed at multiple audiences
3. GBIF
EOL Global Partners ViBRANT
Dutch
Pan- China
Mexico Arab
India
Costa
Rica Colombia
Peru
Australia
South Africa
BHL-
Global
BHL
4. Aims of global partners
Global access to knowledge about life on Earth
To increase awareness and understanding of living
nature through an Encyclopedia of Life that
gathers, generates and shares knowledge in an
open, freely accessible and trusted digital resource
Work together towards this vision and mission, sharing
expertise and knowledge as appropriate
Expand the global pool of knowledge about biodiversity and
improve access to it
5. Aims of this workshop
• Gather content experts from Global Partners
• Become familiar with each other’s work
• Learn how core EOL works and provide
feedback on it
• Form the Species Pages Working Group
Team at Smithsonian (SPG)
Representatives from global partners
• Draft individual plans that complement each
other towards a common goal
• Remind ourselves WHY we want to do this
6. What is content?
Biological information
Names and hierarchies
Descriptive text
Literature
Multimedia
Maps
Links to more information
…..what about comments, collection annotations?
8. Acknowledgements
• Funding from:
David M. Rubenstein gift
John D. and Catherine T. MacArthur Foundation
Alfred P. Sloane Foundation
Smithsonian Institution
Marine Biological Laboratory
Harvard University
and other funders and donors
• All our content partners and global partners
• Volunteer curators and individual contributors via Flickr, Wikimedia,
and members of EOL
• All of you for coming
• Claire Badgley
9. Overview of Content Partnering
Cynthia Parr Global Content Summit
Species Pages Group 17-19 Jan 2011
10. EOL is a content curation
community
Databases
Journals
LifeDesks & Scratchpads
Curate
Public contributions
Aggregate
Comment
Rate, Collect
eol.org
Quality control, prioritization API
Third party apps
18. Content Partner process overview
Partner creates an EOL member account
Adds a content partner
We communicate with them
They (or we) upload a resource file or set a
URL where one can be found
They set a harvest frequency
EOL harvests at that frequency
19. Current methods of data transfer
EOL resource document (XML) (usually they do
the work)
Spreadsheet upload (either can do the work)
Connector (we do the work)
Scrape web site or PDF
Use web services
Work from a copy of DB
Darwin Core Archive (classifications, soon)
See http://eol.org/info/cp_resource_checklist
20. How EOL gets content n=141 partners
70
60
50
40
CSV
30 web
service
20
PDF
10 HTML
DB
0
XML resource doc Connector LD/eLD/Scratchpad
LD/eLD/Scratchpad Spreadsheet
21. Example partner
• Pensoft has a
process to generate
EOL-compliant XML
for new species
• Also sends images to
Morphbank,
specimens to GBIF
• They registered the
URL at EOL
• Our script checks for
changes once a day
22. EOL Schema Sources
Content type Standards used
Taxa Darwin Core Archive
Attribution & licensing Dublin & Darwin Core
Text objects & links Species Profile Model(and
Multimedia now +)
Dublin (+ Audubon Core)
23. Example biological content
EOL Table of Contents TDWG Species Profile
Model
Physical Description › Morphology #Morphology
Physical Description › Size #Size
Ecology › Habitat #Habitat
Ecology › Associations #Associations
Life History & Behavior › Life Expectancy #LifeExpectancy
Evolution and Systematics › Functional #Evolution
Adaptations
Conservation > Conservation Status #ConservationStatus
Molecular Biology and Genetics › Genetics #Genetics
Molecular Biology and Genetics › Genome #MolecularBiology
Molecular Biology and Genetics › Molecular #MolecularBiology
Biology
Nucleotide Sequences #MolecularBiology
24. SPM
DwC infoitem
description
Plinian
Core
using
Darwin Core Archive
flat files as
transport mechanism
EOL v2
26. Partners
Can delete or replace any of their objects
Control how often we harvest, and can force a harvest
Get an automatically updating collection
Can request that we use their classification for browsing
Can change the logo and description of their project
Receive comments and curator actions immediately
Receive monthly reminders they can get traffic statistics
Get many links back to their original web resources
27.
28. Partners cannot
Publish the very first time
Decide if they are pre-vetted
Roll back a harvest
Change the object of any other partners
Change classifications from any other
partners
29. http://eol.org/pages/704102
Richness scores
Cynthia Parr Global Content Summit
Species Pages Group 17-19 Jan 2011
30. Taxon page richness algorithm
a (Breadth) + b (Depth) + c (Diversity)
60% 30% 10%
Breadth: Images, topics of text objects, references, maps,
videos, sounds, conservation status
Depth: # words per text object, # words total
Diversity: Sources (partners)
0 – 100, Threshold 40
31. Summary of EOL page richness
Overall Hot List
950,000 have content 30 % of 75K are rich
2 % are rich Average richness = ~30
~22 % have only links
to literature Red Hot List
56 % of 3K are rich
Average richness = 43
32. How richness is used
Choose images for home page “March of Life”
Allows sorting in collections Weird life example
Helps provide best search and API results
Any other ideas? Could we be matchmakers for
pages needing enrichment and users?
34. Strategies for improving richness
Crowd-sourcing Leveraging
Collections Enabling platforms
Communities Enabling journals
Mobile apps Data mining BHL etc.
35. The page richness index
Helps fill gaps with existing knowledge
Helps prioritize funding and training so that it
has maximum impact on closing true gaps
Will be available via API
Computing and storing richness index on
EOL is a step towards storing and serving
computable data
Editor's Notes
EOL is a giant mashup that merges information that were created elsewhere on its pages which are then available for curators (mostly credentialed scientists) to trust or untrust and rate, or for anybody to provide comments or tags.We’re partnering with over a hundred scientific databases as well as public conribution sites like Flickr and Wikipedia.100+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages500 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
Low hanging fruit is mostly goneFellowsSmaller partners
Partners are projects and databases that we are sharing data with
Why it is important to streamlineAbout 32 partners have managed to make their own XML resource docs – but that probably has the lowest cost per returnBut Connectors may be even more important -- -- web services & db connectors putting content on at least ½ million pagesLDs/Scratchpads important for small partnersSpreadsheets popular, with new transfer schema and flatfile archive format, the XML bar may go down and the spreadsheet might go up
Overview › Brief SummaryOverview › Comprehensive DescriptionOverview › DistributionPhysical Description › MorphologyPhysical Description › SizePhysical Description › Diagnostic DescriptionPhysical Description › Type InformationPhysical Description › Look AlikesPhysical Description › DevelopmentEcology › HabitatEcology › MigrationEcology › DispersalEcology › Diseases and ParasitesEcology › Population BiologyEcology › General EcologyLife History and Behavior › BehaviorLife History and Behavior › CyclicityLife History and Behavior › Life CycleLife History and Behavior › ReproductionLife History and Behavior › GrowthEvolution and Systematics › EvolutionEvolution and Systematics › Fossil HistoryEvolution and Systematics › Systematics or PhylogeneticsEvolution and Systematics › Functional AdaptationsPhysiology and Cell Biology › PhysiologyPhysiology and Cell Biology › Cell BiologyMolecular Biology and Genetics › GeneticsConservation › Conservation StatusConservation › TrendsConservation › ThreatsConservation › LegislationConservation › ManagementRelevance to Humans and Ecosystems › BenefitsRelevance to Humans and Ecosystems › RisksNotesTaxonomyEducation ResourcesCitizen ScienceIdentification Resources
ExtensionLeveraging strengths
Inspired by community ecology & measures of species diversity, which of course were originally inspired by information theory, but we haven’t used those measures. Instead we put together these factors in a way that we could assign weights to different factors based on how well they capture “a rich page”We sampled dozens of pages and had team members assess them for their gestalt “richness” based on their own criteria. Then we compared those scores to those generated by the algorithm, and iteratively changed weights until we achieved a set of weights that appeared to reflect human perception of “richness.”Note that there’s a penalty that unvetted material is only worth about 75% of vetted materialAlso there are maximums for many of these input values – having 200 images may not make a page much more rich than having 25 images.Reserve the right to change this to ensure that the index is as useful as possible. Like Google PageRank, want to ensure that nobody can game the system.
Also note that there is an implication that a “rich page” is a “high quality page” – not necessarily true but often it is.As EOL goes forward with our version 2 we’ll be gathering other inputs that can tell us if a page is successful – ratings of its objects, for example.
This Treemap summarizes the 1.9 million described species that each have a page on the Encyclopedis of life. Some of these pages have only a name so far but about a million of them actually have more than that, with maps, multimedia, text, at least literature references.Each of these species potentially represents a volume in a “living library,” as each has evolved solutions to nature’s challenges, solutions that can benefit human society. For example, the genomics revolution and half of our synthetic drugs were made possible by understanding the characteristics of particular species