Crowd-sourcing the creation of “articles” within the Biodiversity Heritage LibraryBianca Crowleycrowleyb@si.eduTrish Rose-Sandlertrish.rose-sandler@mobot.org
The BHL is…A consortium of 13 natural history, botanical libraries and research institutionsAn open access digital library for legacy biodiversity literature.An open data repository of taxonomic names and bibliographic informationAn increasingly global effortBHLLITA 2011
Problem: Books vs. ArticlesLibrarians manage booksUsers need articlesBHLLITA 2011
Solution: “Article-ization”Creating articles manually, through the help of our users: BHL PDF GeneratorCreating articles through automated means: BioStorhttp://biostor.org/issn/0006-324XPage, R. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(187). Retrieved from http://www.biomedcentral.com/1471-2105/12/187BHLLITA 2011
LITA 2011BHL
Create-your-own PDFBHLLITA 2011
Citebank today: http://citebank.orgBHLLITA 2011
What is an “article” anyway?BHLLITA 2011
the Good, the Bad, the UglyBHLLITA 2011
the Good, the Bad, the UglyBHLLITA 2011
the Good, the Bad, the UglyBHLLITA 2011
Questions for Data AnalysisWhat is the quality, or accuracy, of user provided metadata?What kinds of content are users creating?How can we improve the PDF generator interface?BHLLITA 2011
StatsJan 2010-Apr 2011 	Approx 60,000 pdfs created from PDF Generator40% of those (approx 24,000) were ingested into CiteBank(PDFs without user-contributedmetadata excluded)5 reviewers analyzed 945 pdfs (approx 3.9% of the 24,000+ articles going into Citebank)**Thanks to reviewers Gilbert Borrego, Grace Costantino, and Sue Graves from the Smithsonian Institution BHLLITA 2011
Methodological approachQuantitative – numerical rating systemRated titles, authors, beg/end pagesIts “findability” within CiteBank search often determined how it was ratedBHLLITA 2011
Ratings SystemTitle 1=has all characters in title letter for letter2=does not have all characters in title letter for letter but still findable in CiteBank search 3= does not have all characters in title letter for letter and is NOT findable via the CiteBank searchLITA 2011BHL
Ratings SystemAuthor1=has all characters in author(s) last name letter for letter2=has at least one author’s last name spelled correctly3=has no authors or none of the author’s last names are spelled correctlyLITA 2011BHL
Ratings SystemArticle beginning & ending pages1=has all text pages for an article, from start to end2=subset of pages from a larger article 3=a set of pages where the intellectual content has been compromised. LITA 2011BHL
Analysis stepsLITA 2011
ResultsLITA 2011BHL
What did we learn?Ratings were better than we expectedMany users took the time to create decent metadata “good enough” is not great but is still “findable”LITA 2011BHL
But of course…..there’s always room for improvementOther factorsBHL-Australia’s new portalhttp://bhl.ala.org.au/BHLLITA 2011
Changes we madefor UI so farAsking users if they want to contribute their article to CiteBank
Making article title a required field and validating it so its at least 2 or more characters
 Review button for users to review page selections and metadata (inspired by BHL-AUS)
Reduced text and increased more intuitive graphics (inspired by BHL-AUS)BHLLITA 2011

Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Library

  • 1.
    Crowd-sourcing the creationof “articles” within the Biodiversity Heritage LibraryBianca Crowleycrowleyb@si.eduTrish Rose-Sandlertrish.rose-sandler@mobot.org
  • 2.
    The BHL is…Aconsortium of 13 natural history, botanical libraries and research institutionsAn open access digital library for legacy biodiversity literature.An open data repository of taxonomic names and bibliographic informationAn increasingly global effortBHLLITA 2011
  • 3.
    Problem: Books vs.ArticlesLibrarians manage booksUsers need articlesBHLLITA 2011
  • 4.
    Solution: “Article-ization”Creating articlesmanually, through the help of our users: BHL PDF GeneratorCreating articles through automated means: BioStorhttp://biostor.org/issn/0006-324XPage, R. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(187). Retrieved from http://www.biomedcentral.com/1471-2105/12/187BHLLITA 2011
  • 5.
  • 6.
  • 7.
  • 8.
    What is an“article” anyway?BHLLITA 2011
  • 9.
    the Good, theBad, the UglyBHLLITA 2011
  • 10.
    the Good, theBad, the UglyBHLLITA 2011
  • 11.
    the Good, theBad, the UglyBHLLITA 2011
  • 12.
    Questions for DataAnalysisWhat is the quality, or accuracy, of user provided metadata?What kinds of content are users creating?How can we improve the PDF generator interface?BHLLITA 2011
  • 13.
    StatsJan 2010-Apr 2011 Approx 60,000 pdfs created from PDF Generator40% of those (approx 24,000) were ingested into CiteBank(PDFs without user-contributedmetadata excluded)5 reviewers analyzed 945 pdfs (approx 3.9% of the 24,000+ articles going into Citebank)**Thanks to reviewers Gilbert Borrego, Grace Costantino, and Sue Graves from the Smithsonian Institution BHLLITA 2011
  • 14.
    Methodological approachQuantitative –numerical rating systemRated titles, authors, beg/end pagesIts “findability” within CiteBank search often determined how it was ratedBHLLITA 2011
  • 15.
    Ratings SystemTitle 1=hasall characters in title letter for letter2=does not have all characters in title letter for letter but still findable in CiteBank search 3= does not have all characters in title letter for letter and is NOT findable via the CiteBank searchLITA 2011BHL
  • 16.
    Ratings SystemAuthor1=has allcharacters in author(s) last name letter for letter2=has at least one author’s last name spelled correctly3=has no authors or none of the author’s last names are spelled correctlyLITA 2011BHL
  • 17.
    Ratings SystemArticle beginning& ending pages1=has all text pages for an article, from start to end2=subset of pages from a larger article 3=a set of pages where the intellectual content has been compromised. LITA 2011BHL
  • 18.
  • 19.
  • 20.
    What did welearn?Ratings were better than we expectedMany users took the time to create decent metadata “good enough” is not great but is still “findable”LITA 2011BHL
  • 21.
    But of course…..there’salways room for improvementOther factorsBHL-Australia’s new portalhttp://bhl.ala.org.au/BHLLITA 2011
  • 22.
    Changes we madeforUI so farAsking users if they want to contribute their article to CiteBank
  • 23.
    Making article titlea required field and validating it so its at least 2 or more characters
  • 24.
     Review button forusers to review page selections and metadata (inspired by BHL-AUS)
  • 25.
    Reduced text andincreased more intuitive graphics (inspired by BHL-AUS)BHLLITA 2011

Editor's Notes

  • #7 Add link: http://biodiversitylibrary.org/item/54249
  • #19 Highlight row?Show article in CB