These are Cyndy Parr's presentations at the EOL Global Partner Summit, starting with an overview of the meeting, and including an overview of how we set up content partnerships, and how we calculate and use page richness scores.
EOL is a giant mashup that merges information that were created elsewhere on its pages which are then available for curators (mostly credentialed scientists) to trust or untrust and rate, or for anybody to provide comments or tags.We’re partnering with over a hundred scientific databases as well as public conribution sites like Flickr and Wikipedia.100+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages500 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
Low hanging fruit is mostly goneFellowsSmaller partners
Partners are projects and databases that we are sharing data with
Why it is important to streamlineAbout 32 partners have managed to make their own XML resource docs – but that probably has the lowest cost per returnBut Connectors may be even more important -- -- web services & db connectors putting content on at least ½ million pagesLDs/Scratchpads important for small partnersSpreadsheets popular, with new transfer schema and flatfile archive format, the XML bar may go down and the spreadsheet might go up
Overview › Brief SummaryOverview › Comprehensive DescriptionOverview › DistributionPhysical Description › MorphologyPhysical Description › SizePhysical Description › Diagnostic DescriptionPhysical Description › Type InformationPhysical Description › Look AlikesPhysical Description › DevelopmentEcology › HabitatEcology › MigrationEcology › DispersalEcology › Diseases and ParasitesEcology › Population BiologyEcology › General EcologyLife History and Behavior › BehaviorLife History and Behavior › CyclicityLife History and Behavior › Life CycleLife History and Behavior › ReproductionLife History and Behavior › GrowthEvolution and Systematics › EvolutionEvolution and Systematics › Fossil HistoryEvolution and Systematics › Systematics or PhylogeneticsEvolution and Systematics › Functional AdaptationsPhysiology and Cell Biology › PhysiologyPhysiology and Cell Biology › Cell BiologyMolecular Biology and Genetics › GeneticsConservation › Conservation StatusConservation › TrendsConservation › ThreatsConservation › LegislationConservation › ManagementRelevance to Humans and Ecosystems › BenefitsRelevance to Humans and Ecosystems › RisksNotesTaxonomyEducation ResourcesCitizen ScienceIdentification Resources
Inspired by community ecology & measures of species diversity, which of course were originally inspired by information theory, but we haven’t used those measures. Instead we put together these factors in a way that we could assign weights to different factors based on how well they capture “a rich page”We sampled dozens of pages and had team members assess them for their gestalt “richness” based on their own criteria. Then we compared those scores to those generated by the algorithm, and iteratively changed weights until we achieved a set of weights that appeared to reflect human perception of “richness.”Note that there’s a penalty that unvetted material is only worth about 75% of vetted materialAlso there are maximums for many of these input values – having 200 images may not make a page much more rich than having 25 images.Reserve the right to change this to ensure that the index is as useful as possible. Like Google PageRank, want to ensure that nobody can game the system.
Also note that there is an implication that a “rich page” is a “high quality page” – not necessarily true but often it is.As EOL goes forward with our version 2 we’ll be gathering other inputs that can tell us if a page is successful – ratings of its objects, for example.
This Treemap summarizes the 1.9 million described species that each have a page on the Encyclopedis of life. Some of these pages have only a name so far but about a million of them actually have more than that, with maps, multimedia, text, at least literature references.Each of these species potentially represents a volume in a “living library,” as each has evolved solutions to nature’s challenges, solutions that can benefit human society. For example, the genomics revolution and half of our synthetic drugs were made possible by understanding the characteristics of particular species
Global content summit: Overview, content partnering, richness
http://www.eol.org• All species known to
science• Freely accessible: open access, open source• Available from a single portal in a common format• Quality• Constantly growing• Aimed at multiple audiences
Aims of global partners Global
access to knowledge about life on Earth To increase awareness and understanding of living nature through an Encyclopedia of Life that gathers, generates and shares knowledge in an open, freely accessible and trusted digital resourceWork together towards this vision and mission, sharingexpertise and knowledge as appropriateExpand the global pool of knowledge about biodiversity andimprove access to it
Aims of this workshop• Gather
content experts from Global Partners• Become familiar with each other’s work• Learn how core EOL works and provide feedback on it• Form the Species Pages Working Group Team at Smithsonian (SPG) Representatives from global partners• Draft individual plans that complement each other towards a common goal• Remind ourselves WHY we want to do this
Acknowledgements• Funding from: David M.
Rubenstein gift John D. and Catherine T. MacArthur Foundation Alfred P. Sloane Foundation Smithsonian Institution Marine Biological Laboratory Harvard University and other funders and donors• All our content partners and global partners• Volunteer curators and individual contributors via Flickr, Wikimedia, and members of EOL• All of you for coming• Claire Badgley
Content Partner process overviewPartner creates
an EOL member accountAdds a content partnerWe communicate with themThey (or we) upload a resource file or set a URL where one can be foundThey set a harvest frequencyEOL harvests at that frequency
Current methods of data transferEOL
resource document (XML) (usually they do the work)Spreadsheet upload (either can do the work)Connector (we do the work) Scrape web site or PDF Use web services Work from a copy of DBDarwin Core Archive (classifications, soon)See http://eol.org/info/cp_resource_checklist
How EOL gets content n=141
partners70605040 CSV30 web service20 PDF10 HTML DB 0 XML resource doc Connector LD/eLD/Scratchpad LD/eLD/Scratchpad Spreadsheet
Example partner• Pensoft has a
process to generate EOL-compliant XML for new species• Also sends images to Morphbank, specimens to GBIF• They registered the URL at EOL• Our script checks for changes once a day
EOL Schema SourcesContent type Standards
usedTaxa Darwin Core ArchiveAttribution & licensing Dublin & Darwin CoreText objects & links Species Profile Model(andMultimedia now +) Dublin (+ Audubon Core)
Example biological contentEOL Table of
Contents TDWG Species Profile ModelPhysical Description › Morphology #MorphologyPhysical Description › Size #SizeEcology › Habitat #HabitatEcology › Associations #AssociationsLife History & Behavior › Life Expectancy #LifeExpectancyEvolution and Systematics › Functional #EvolutionAdaptationsConservation > Conservation Status #ConservationStatusMolecular Biology and Genetics › Genetics #GeneticsMolecular Biology and Genetics › Genome #MolecularBiologyMolecular Biology and Genetics › Molecular #MolecularBiologyBiologyNucleotide Sequences #MolecularBiology
PartnersCan delete or replace any
of their objectsControl how often we harvest, and can force a harvestGet an automatically updating collectionCan request that we use their classification for browsingCan change the logo and description of their projectReceive comments and curator actions immediatelyReceive monthly reminders they can get traffic statisticsGet many links back to their original web resources
Taxon page richness algorithma (Breadth)
+ b (Depth) + c (Diversity) 60% 30% 10%Breadth: Images, topics of text objects, references, maps,videos, sounds, conservation statusDepth: # words per text object, # words totalDiversity: Sources (partners) 0 – 100, Threshold 40
Summary of EOL page richnessOverall
Hot List950,000 have content 30 % of 75K are rich2 % are rich Average richness = ~30~22 % have only linksto literature Red Hot List 56 % of 3K are rich Average richness = 43
How richness is usedChoose images
for home page “March of Life”Allows sorting in collections Weird life exampleHelps provide best search and API resultsAny other ideas? Could we be matchmakers for pages needing enrichment and users?
The page richness indexHelps fill
gaps with existing knowledgeHelps prioritize funding and training so that it has maximum impact on closing true gapsWill be available via APIComputing and storing richness index on EOL is a step towards storing and serving computable data