Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

5,025 views

Published on

Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource by Gaurav Vaidya, based on a paper by Andrea Thomer, Gaurav Vaidya*, Robert Guralnick, David Bloom and Laura Russell. Presented November 8, 2012 (http://www.mcn.edu/2012/extracting-data-historical-documents-crowdsourcing-annotations-wikisource)

Find out more at http://bit.ly/jhfnblog

Published in: Technology
  • Be the first to comment

Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

  1. Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource Andrea Thomer, Gaurav Vaidya*, Robert Guralnick, David Bloom, Laura Russell
  2. http://data.gbif.org/GBIF (389 million records!)
  3. http://www.mappinglife.org/Sayornis_sayaWhere species are, where species aren’t
  4. Chronhorogram (Ariño & Otegui, 2010), extracted using BIDDSAT (Otegui & Ariño, 2012)http://www.unav.es/unzyec/mzna/biddsat/recsperyear.php?prov=10&dataset=all&db=GBIF_201202 The big picture (AKN)
  5. http://commons.wikimedia.org/wiki/File:Tent_in_montane_field_site.tifAn expedition into the Rockies, 1904
  6. http://commons.wikimedia.org/wiki/File:Step_Valley_Lake_near_Arapahoe_Glacier.tif The Great Outdoors
  7. http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif Exploration time
  8. http://pinterest.com/cumnh/ http://media-cache-ec3.pinterest.com/avatars/ucmnh-1346976471_600.jpgUniversity of Colorado Museum of Natural History (CUMNH) -- founded 1909
  9. http://commons.wikimedia.org/wiki/File:Junius_Henderson.jpgJunius Henderson CUMNH Curator, 1902-1933
  10. http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif Exploration time
  11. Henderson’s notebooks
  12. “This entire project was only possiblebecause people had been making smallsteps towards digitization over the last 10 years” -- Andrea Thomer
  13. Wikisource: a transcription platform
  14. Step 1: Scanning (1996)
  15. The Process1. Images on the Wikimedia Commons.2. Images + text on Wikisource.3. Images + text + annotations on Wikisource.4. Data using the MediaWiki APIs.• Full details: http://dx.doi.org/10.3897/zookeys.209.3247 • Short URL: http://bit.ly/henderson-paper
  16. #1. The Wikimedia Commons
  17. http://commons.wikimedia.org/wiki/File:Licensing_tutorial_en.svg Copyright?
  18. http://commons.wikimedia.org/wiki/Template:PD-scanhttp://commons.wikimedia.org/wiki/Template:PD-US-unpublished Copyright!
  19. http://commons.wikimedia.org/wiki/File:Field_Notes_of_Junius_Henderson,_Notebook_1.pdf Result #1: Images
  20. http://en.wikisource.org/wiki/Index:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu #2. Images + text
  21. http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371 Just like Wikipedia
  22. http://cumuseum.colorado.edu/about/newsdetail.php?newsID=3 Dr. Peter Robinson CUMNH Director, 1971-1982Transcribed Henderson’s notebooks, 2000-02
  23. Step 2: Transcription
  24. http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371Result #2: Images + text
  25. http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371Result #2: Images + text
  26. Combining multiple pages
  27. http://en.wikisource.org/w/index.php? title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371#3. Images + text + annotations
  28. Wikipedia templates
  29. Wikipedia templates are everywhere
  30. The “Neutrality” template
  31. The “Neutrality” template
  32. Examples of templates
  33. Examples of templates
  34. Examples of templates
  35. An template of our own {{element|formal name of this element| element as written by Henderson}} Examples: {{taxon|Sayornis saya|Say Phoebe}} {{taxon|Carduelis pinus|siskins}} {{taxon|Siskin|siskins}}
  36. An template of our own {{element|formal name of this element| element as written by Henderson}} Examples: {{dated|1905-07-28|July 28, 1905}} {{place|Boulder, Colorado|Boulder, Colo}}
  37. #3. Annotations
  38. #3. Annotations
  39. #3. Annotations
  40. Calling all volunteers!
  41. Calling all volunteers!
  42. Result #3. Image + text + annotations!
  43. Volunteers arrive
  44. Volunteers arrive
  45. http://www.mappinglife.org/Sayornis_saya #4. Data
  46. Simple algorithm
  47. Simple algorithm
  48. Simple algorithm
  49. Simple algorithm
  50. Simple algorithm
  51. Complicated script
  52. Complicated, open source script
  53. Result #4. (T + Images + Annotation) ext = Data!
  54. http://commons.wikimedia.org/wiki/File:Bighorn_sheep_skull_at_Arapaho_glacier,_1904.tifWhere do we go from here?
  55. More books to upload
  56. More books to transcribe
  57. http://www.biodiversitylibrary.org/More books to transcribe
  58. https://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf?page=19 A better Wikisource
  59. “This entire project was only possiblebecause people had been making smallsteps towards digitization over the last 10 years” -- Andrea Thomer
  60. Thanks!Find out more at http://bit.ly/jhfnblog
  61. The following slides werenot used in my presentation
  62. Museum collections
  63. 240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson Museum records
  64. 240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson Museum records
  65. 240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson Problem: context

×