Taking TL-2 Online A Linked Data Resource Martin R. Kalfatovic Joel Richard Smithsonian Libraries TDWG 2011 Annual Conference New Orleans, LA 18 October 2011
TL 1 (1967) xx, 556 pp. A-Z [not numbered] TL-2/1 2 (1976) xl, 1136 pp. A-G 1-2223 TL-2/2 (1979) xviii, 991 pp. H-Le 2224-4483 TL-2/3 (1981) xii, 980 pp. Lh-O 4484-7174 TL-2/4 (1983) ix, 1214 pp. P-Sak 7175-10,104 TL-2/5 (1985) [v], 1066 pp. Sal-Ste 10,105-13,105 TL-2/6 (1986) [v], 926 pp. Sti-Vuy 13,106-16,459 TL-2/7 (1988) lvi, 653 pp. W-Z 16,460-18,785 Suppl. 1 3 (1992) viii, 453 pp. A-Ba 18,786-20,458 Suppl. 2 (1993) vi, 464 pp. Be-Bo 20,459-22,485 Suppl. 3 (1995) vi, 550 pp. Br-Ca 22,486-25,190 Suppl. 4 (1997) vi, 614 pp. Ce-Cz 25,191-28,566 Suppl. 5 (1998) viii, 431 pp. Da-Di 28,567-30,948 Suppl. 6 (2000) vi, 518 pp. Do-E 30,949-33,658 Suppl. 7 (2009) xviii, 469 pp. F-Frer 33,659-35,497 Suppl. 8 (2009) viii, 560 pp. Fress-G 35,498-37,609 TL-2 print volumes … by the numbers
While TL-2 is an essential reference tool to plant scientists, reference librarians, and catalogers, it is less well known to the broader natural sciences community. The community will immediately benefit from open online access to the detailed treatments of 9,072 authors and 37,609 numbered citations.... The content provides precise dates of publication , details about specific titles, editions, and related publications, associated authors, biographical details including dates, education and career highlights, and disposition of herbaria. -Judy Warnement, Harvard Univ. Botany Libraries
Scope of the Project > Print version: 15 volumes > Pages: ~ 11, 000 > Characters/Page Avg. of 3,400 > Author entries: ~ 44, 000 > Image files: ~ 9 GB in size
Project Starts: 2010 IAPT and Smithsonian Libraries sign agreement to create a new online version of TL-2 Funded by the Smithsonian's Atherton Seidell Endowment
Image and File Creation Scanned print volumes done through Internet Archive 100% quality control review of scanned pages by SIL staff Re-keyed to 99.97% accuracy
Project Milestones I January 2011 Scanning of print volumes complete; image files on BHL December 2010 IAPT staff provide machine-readable versions of more recent volumes
Project Milestones II November 2011 Completion of text conversion of TL-2 September 2011 Test conversion done and conversion methodologies approved
Project Milestones III January 2012 TL-2 Online, version 1.0 publicly available via a Smithsonian Institution Libraries website. This version will provide, at a minimum, all the functionality currently provided by the limited access version
Example of Conversion Specs > Introduction sections can be omitted. Introductory text to the indexes can be omitted. > Accented letters and diacritics must be preserved. > The beginning of all non-indented lines should be indicated with a <br/> tag. > Indented lines of text are not indicated > Bold and italicized text should not be indicated. > Hyphenated words will be maintained throughout the text. > The presence of a tables should be indicated by a <Table/> tag. No other parsing should be done. > Each line in the indexes will be converted to simple XML, but not parsed into fields.
What do you Want!!! February 2012 forward TL-2 Online, version X.0; additional functionality will be added to the TL-2 Online version; details of this functionality will be developed with the input of the botanical & taxonomic community.
Planned Future Developments I Initially TL-2 will be presented in a basic website that is searchable by keyword, botanist name, TL-2 title number, or TL-2 botanist or title abbreviation . The website will display the search results with the scanned page (as a zoomable JPEGs) and the parallel OCRed and corrected text. The full OCRed text may be made available for download and the scanned pages can also be browsed in a "page turning" application. This will be the form that the TL-2 site will take before migration to the Libraries' Digital Library website next spring.
Planned Future Developments II The second round of planned improvements to the TL-2 site includes implementing Linked Open Data for the entire TL-2 dataset. This computer-friendly format will enhance the reusability of the TL-2 data for projects now and in the future. Each botanist and title will have a unique URI on the Libraries' website. This URI will be a permanent, authoritative location on the web for the botanists and titles and information about each in both human-readable form (via HTML) and computer-readable form (via RDF/XML.) The implementation of Linked Open Data also facilitates the creation of a SPARQL endpoint , which allows the data contained in our website to be queried like a database.
Planned Future Developments III We plan to add to our linked data by parsing the herbaria names with the goal of linking them to their real names in the TL-2 index and to an external location on the Web. Once the herbaria are identified and linked, they can be search forwards by listing the herbaria containing a botanists' plant specimens and backwards by indicating which TL-2 botanists contributed to a given herbarium.
Planned Future Developments IV Additionally, we plan to look up each botanist in TL-2 to their record at the Virtual International Authority File (VIAF) to improve identification of the botanist and the ability to link to other sources on the internet. Similarly, we hope to decode and resolve the bibliographic entries for each botanist and link them to the Biodiversity Heritage Library or other appropriate online databases.
Planned Future Developments V Finally, the each botanist may have one or more species that are named after them. This information includes a genus name, the person who named the genus (the author) and the year that the name was created. We aim to identify the species names and link them to the Encyclopedia of Life, the Biodiversity Heritage Library, or other more appropriate online databases of species names. Additionally, we would like to connect the author to his or her record in TL-2 , if it exists, thereby creating additional internal cross-links in TL-2.
Can you name the botanists you saw? Herman Boerhaave Nathaniel Lord Britton A.P. de Candolle Carolus Clusius Cadwallader Colden Erasmus Darwin R.L. Desfontaines Larry Dorr Henry Englefield Joseph Dalton Hooker C.M. Hovey N.J. von Jacquin Bernard de Jussieu Carl Linnaeus C.F.B. Mirbel Dan Nicholson J.W. Palmstruch Richard Pulteny Henry Shaw Martin Vaul Judy Warnement
Thanks to the following collaborators on the project Susan Fraser Don Wheeler Judy Warnement Doug Holland Chris Freeland IAPT Unlimited Priorities Internet Archive Data Conversion Lab Smithsonian Team Gilbert Borrego Grace Costantino Larry Dorr Robin Everly Sue Graves Suzanne Pilsk Joel Richard Keri Thompson