Lita national forum 2012


Published on

A followup on our 2011 presentation on the new Linked Open Digital Library, discussing how we are creating a digital library centered around LInked Open Data. Include details on how we are creating a dataset of botanists and their publications that is to be shared as linked open data.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • (2-3 min) Open with an introduction of who SIL is and what we do? (Old Slide 1 and 2)Questions: How many know SI has libraries? How many have visited the libraries? How many want to visit?
  • To recap from last year, we covered a solid introduction on linked data and how Drupal 7 supports it out of the box via the built-in RDF and RDFx modules.
  • We talked about what RDFa might look like in a webpage or RDF/XML stream that we are creating.
  • We discussed this TL-2, taxonomic literature, reference tool for botanists and how we are converting it to Linked Open Data.
  • And finally for TL-2 we offered some idea of the kind of data that we might be producing in RDF. This is yet another representation of the linked data, this time in N-Tuples format.
  • Finally, we talked about how Open data, (not linked open data) is benefiting the Biodiversity Heritage Library. If you spend any amount of time around me, you’ll find that I will eventually come around to talking about this.
  • And some examples of how people have used open data. This person mapped the usage of certain animal names over time and how they fall in or out of favor as time progresses. Those bars are time periods of 200 years of natural history literature.
  • SLIDE: Overview of Linked Data (concept, statistics)
  • This is linked data in action. Google knowledge graph. Google acquired Metaweb in 2010 and in that process, they got Freebase, which eventually was used to create this new pane of information on Google.
  • SLIDE: Details of Linked Data (diagram of triple)
  • (5 min) Review our discussion from last year. Sharing knowledge is our prime directiveLinked Data is a no-brainer Not going to to review what linked data is (unless we need to?)SLIDE: Overview of our website (statistics, content, etc) (LITA 2011 page 6)Questions: How many know what linked data is? Do we need to review?
  • SLIDE: Content that could be linked data (LITA 2011 page 9)Quick review of what things have good metadata for likingWe said we would have something up in about one year. (Ha!)Last year I reviewed some of the details of how we are converting to linked data 
  • (Show this again, but only briefly) (Old Slide 20, 21 22)SLIDE: Details of Darwin's linked data fields (LITA 2011 page 22)TAKEAWAY: Know your data (or whatever it is you’re sharing). Become intimately familiar with it. Take it on a date.
  • List some of the modules we are using (Old Slide 15)SLIDE: List of Drupal Modules (LITA 2011 page 15)Questions: How many of you are using linked data? What data do you have that could be useful if linked? Know that if you raise your hand, I'm going to pick on you throughout the rest of the talk. :)Disclaimer: We are still learning as we go! Even we, the Smithsonian, are figuring things out. We are also constrained by budgets, personnel and other requirements, possibly more as government entity.
  • SLIDE: We are still learningFirst we had to decide what a Digital Library was. Our instinct is to go online and see what other people are doing. This is fine and all, but I think it's safe to say that we know what data we have, we know what we are doing as we move from an old website to a new. What of it belongs in the digital library? Well... here's what we have.It’s safe to say that we know our data, though we may go to others to see how to present that data. We’ll also use focus groups and usability studies to analyze the site once we have a beta.TAKEAWAY: You’ll always be learning. :) If you stop, you become irrelevant.
  • SLIDE: What is a digital library? Books? Images? Exhibitions? Databases? Research papers? All of these things?Question: How many of you have a “digital library”. Want one?Question: Is anyone out there working with data that doesn't fall in these? I'm curious as to what else might be out there.
  • As far as vast amounts of linked data goes, currently there are two that stand out as really good useful datasets:SLIDE: Two data sets: TL2 and Index Animalium, numbers of records, types of dataFor us, the things that make sense to publish as linked data are TL2 (47k records) and Index Animalium (500k records). TL2 is almost there. IA has a long, long way to go. We'll come back to that.The first phase of our process was to get us on drupal. This actually took longer than we'd hoped due to the planning required by our nature as a government institution. We have a certain level of planning and security analysis that must be done. That said, we have a simple brochure-ware website that is online at licensed their base metadata for their collections as CC0. Talk to the lawyers first.TAKEAWAY: Creative Commons (or CC0) licensing of metadata is becoming popular. We have a CC-BY license for TL2. Index Animalium is public domain. I think. We are libraries and we have a lot to share to the internet. Let’s make it happen so that others don’t.
  • SLIDE: Content that could be linked data (LITA 2011 page 9)Quick review of what things have good metadata for likingWe said we would have something up in about one year. (Ha!)Last year I reviewed some of the details of how we are converting to linked data 
  • SLIDE: TL2 website as it is today. How do we get it into Drupal? We use a module!Drupal is capable of handling millions of records, but getting those records into Drupal is not the easiest thing in the world. How do we import 430,000 species names for Index Animalium?Question: How many others are using a CMS? Drupal? (what is the name of that MS Technology to compete with Drupal?) PHP? ASP? Java? Others?Question: Is anyone developing in Drupal? Modules? Themes?
  • Now that we are on drupal, we can move forward with some data! Yeah! Bring on the import!Disclaimer: The actual steps are specific to drupal, but you may find yourself in a similar situation of trial and error.Last time I reviewed how we were going to take this taxonomic literature thing to linked data. We have something almost online, but let's review where we are...We first imported via Feeds Importer (Question: anyone familiar?). Then we had to import again. Oops, the data was wrong again, so we had to import AGAIN. Three weeks later, I gave up. It was too slow and too painful. SLIDE: Feeds importer: 7 hours. 47,000 records in 7 hours? 1.8 rec/sec - Dismal!
  • So I wrote a module! Yay! Module development! This makes sense. But there was one major challenge: I didn't know how to build modules in Drupal. So I learned. And then I realized that I could import the data as part of the installation of the module. Import times dropped to 81 minutes. (an improvement as I could control what the database was doing and minimize database traffic.)SLIDE: Drupal Module development is hard! Steep learning curve. List APIs that I had to become familiar with: Field. Node. Theme. Styling. Preprocess Functions. Render Elements.And THEN I learned that we could use the versioning of modules to update the data down the road. Either to create new database fields, munge the data, etc. This is a nice feature since we couldn't do that before. (12-15 hour downtime for our TL-2 site would have been a bad thing indeed)TAKEAWAY: Consider your options, the easy way is not always faster/better.
  • So, we decided to use another module! Home grown! Versioned Data! We needed something to manage the delivery of the books using the IFRAME version of the Internet Archive book-reader. But uploading the data is even better. This time we were able to import in about 5 minutes. This handles the books, authors, vocabularies, subjects (FAST?), places as subject, timeframe as subject. It also handles the links between them. Much of this data came out from the MARCXML record, but sometimes we used MODS (where it was easier)Synchronization issues regarding the book metadata between IA, SIRIS, Picklist and Drupal. FUN!What do you do when your data lives in multiple places. One master many slaves? Multi-master? Mixed bag of drunken cats?SLIDE: And books have linked data, too! We're not sure how we are going to link it, but at least we'll have OCLC number, author name to VIAF, etc.
  • Before we began really building our site, we needed to firm up our data model and make sure we had a good idea on how everything is going to relate to each other. This is an example of what the British Library created. I think they were very thorough and included a ot of detail. It is probably overkill for what we want to do, but who knows, we may end up in the same place, but maybe not in such an explicit manner.
  • How do we structure our data? How do we organize it? What vocabularies will we be using? QUESTION: For those who are familiar with LOD, are you using any vocabularies other than these? Anyone making their own?
  • Talk about Galaxy of Images, Other elements in the digital libraryPlates and other pretty pictures. Show the website for GOI. Search page, etc.Highlight the balloon that was StumbleUpon-ed and boosted our traffic 100-fold. Show a picture of the GA chart of the traffic.The data needs some cleanup. Standardization of the subjects metadata.Images need to be moved into DAMS (Artesia digital asset management system)This is being done in coordination with the manage of the GOI and our metadata team who is As an aside, one of the things we do to get new pretty images is to capture the plates from our metadata collection thingy for the BHL. We divert a stream of data of the "pretty pictures" from there into the Galaxy of Images through a mostly automated process. This will automatically upload (For ongoing projects, stress automation where possible. Take humans out of the equation. As smart as we are, we make mistakes. Code doesn't unless we make mistakes in our code and it frees us to do other things.)
  • Talk about VideosOver 8 or 10 years of them, we needed to round them all up and get them organized. Lectures, animations, videos, interviews, demos, informational things. 30-40 of them? All are (or should be) on YouTube at this point in time. Ultimatley we will serve them from our DAMSCentered around our content, exhibitions,etc
  • Collections / Exhibitions Arbitrary Collections of things. Exhibitions, tooCollections: arbitrary grouping of things under a heading (category) with maybe some introductory text.Exhibition: Same thing, but more sequential, telling a story of narrative. Order becomes important than in collections. Possibly more words.Bibliographies, lists of things, subject guidesLegacy content. Not sure if we need to keep it alive. Is it something that people continue to use. We’ll check out our analytics. HOWEVER, as they are tied to the library itself, we’ve already had to migrate them to the new site. Perhaps a bit of wasted effort, but at least it’s easier to manage now.Trade LiteratureDescribe them – Scientific Instruments, sewing machines!How are they catalogued (they are not) Catalogued by Manufacturer, well, inventoried. Nothing is scanned, we would like to scan them, but it poses some of its own challenges in how we organize the content. Each catalog can’t be a record in our, um, catalog, can it? SI PublicationsCollecting the output of the researchers at the smithsonian to gauge their … effectiveness, reach, influence, (Klout?)Currently in Dspace, will likely stay there, but we want to index and search it via the website, see: Summon Discovery LayerBlogThe blog is part of the website, too, but as it lives off in its own world, we don’t really need to concern ourselves with it because it’s not really part of the digital library per se.TAKEAWAY: Each set of content that you have may be different from the others. Creating a digital library is not going to be an easy task.
  • Todo in the future:Made our own vocabulary for TL2: turns out we only needed two or three terms. The Biography vocabulary had much of what we needed already.Plan the migration of our exhibitions, which will lay the foundation for other online collections.Migrate our image into our DAMS systems, refining the metadata in the process, which will preclude us from having to store all these images on our web server.Figure out a method of handling collections and the arbitrary ordering of things. Is there a module? Should we make one? Should we reuse things that already exist (yes!)List some of the other tools that people might use for LOD. Take from my talk at SLA.Discuss Summon and the giant black box that it isIt’s on the way, it will be the discovery layer for our entire site. All our data needs to get into it. Including our catalog, our licensed content, all website content, blog content. API development, Integration with Drupal is a big mystery. Do I see another module in my future? :) If so, it will be similar to that of the Google Search Appliance module.How to leverage LOD for more stuff. Artists Files, Trade Lit, etc. linking to our catalog, history books, etc.TAKEAWAY: A website is a living, breathing, growing beast. I needs care and feeding and love and attention to keep it going.
  • Open the floor for questions
  • Lita national forum 2012

    1. 1. Building the New Open Linked Library (Revisited) Joel Richard LITA National Forum 2012 October 5, 2012
    2. 2. Smithsonian Libraries• Founded in 1846• 1.5 m volumes in collection, plus assorted archival collections• 15,000 volumes scanned and online• 20 libraries serving ~500 researchers/curators + hundreds of fellows and interns• 105 library staff• 1.5 web staff• Founding member of the Biodiversity Heritage Library Le Garde-meuble, ancien et moderne [Furniture repository, ancient and modern], 1839-1935
    3. 3. (From 2011)Drupal and Linked Data• Native support for RDFa in Drupal 7.• RDF Extensions (rdfx) – even more features.• Vocabularies can be imported and cached for reuse.• Few or no modifications to HTML to support RDFa.What’s the difference between RDF,RDF/XML and RDFa? LITA National Forum, September 30, 2011
    4. 4. (From 2011)RDF/XML Sample URI: <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="" xmlns:dc="" xmlns:bibo=""> <rdf:Description rdf:about="http://localhost:8087/content/ origin-species"> <rdf:type rdf:resource=""/> <dc:title>The Origin of Species</dc:title> <dc:created>November 24, 1859</dc:created> <bibo:numPages>1000</bibo:numPages> <dc:language>english</dc:language> <bibo:authorList rdf:resource="http://localhost:8087/content/darwin-charles"/> <owl:sameAs rdf:resource=“”> </rdf:Description> </rdf:RDF> LITA National Forum, September 30, 2011
    5. 5. TL-2 Page Sample (From 2011) tl2:creatorOf owl:sameAs dc:creator owl:sameAs originofspecies00darwuoft LITA National Forum, September 30, 2011
    6. 6. TL-2 Page Sample Results (From 2011) dc:creator “” “”owl:sameAs owl:sameAs “” ” originofspecies00darwuoft”foaf:lastName “Darwin” tl2:bookNumber “1313”foaf:familyName “Darwin” bibo:shortTitle “On the origin of species”foaf:firstName “Charles” dc:title “On the origin of species by meansfoaf:givenName “Charles” of natural selection, or the preservation of favoured races in the struggle forfoaf:name “Darwin, Charles Robert” life.”skos:prefLabel “Darwin, Charles Robert” event:place “London”tl2:birthYear “1809” dc:publisher “John Murray”tl2:deathYear “1882” dc:created “1859”tl2:description “British evolutionary biologist” tl2:bookAbbreviation “Origin sp.”tl2:personAbbrev “Darwin” LITA National Forum, September 30, 2011
    7. 7. (From 2011)LITA National Forum, September 30,2011
    8. 8. (From 2011) Who is reusing our data?Ryan Schenk – LITA National Forum, September 30, 2011
    9. 9. (From 2011) Who is reusing our data?Encyclopedia of Life – LITA National Forum, September 30, 2011
    10. 10. Linked Data Review• Publishing structured data on the web• RDF (Resource Description Framework)• Enables queries computer 2 computer• Uses standard ontologies (vocabularies)• Data in is presented as “triples”URI owl:sameAsObject
    11. 11. Linked Data In ActionGoogle Knowledge Graph
    12. 12. Linked Data Review “Feb 12 1809” Born On Type City Born In Charles Darwin Shrewsbury Is In England Type Person Type Country
    13. 13. Our WebsiteOrganically grown since 1995 • 83,000 HTML pages • 3,700 ColdFusion pages • 253,000 JPEG files • 27,000 PNG files • 46,000 PDFs No CMS for legacy information Now using Drupal for “Brochure-ware”
    14. 14. Content Analysis• 400+ Online “books”• Exhibitions• Research Tools• Image Collections (16,000+ images)• “Brochure” content (About us, Locations, Hours)• Bibliographies, Fact Sheets, Subject Guides• Databases, inventories, and database-like books Collections not on our website:• ~15,000 digitized volumes, with many more planned• Other analog collections that will be digitized Bureau of American Ethnology Bulletin 164; Sewing Machine Trade Literature; Underwater Web Exhibition, Smithsonian Libraries
    15. 15. Linked Data in our LibraryBooks (and book-like objects) • Expose bibliographic data for reuse • Consume links to other internal content and external authoritative dataDatabases • Expose data previously unavailable • Provide authoritative data • Consume our data and others’ to create new aggregate websites
    16. 16. Linked Data in our Books RDF Type = foaf:Person foaf:lastName, foaf:familyName foaf:firstName, foaf:givenName foaf:name, skos:prefLabel tl2:birthYear tl2:deathYear tl2:description tl2:personAbbrev RDF Type = bibo:Book tl2:bookNumber dc:title event:place dc:publisher tl2:bookAbbreviation dc:created
    17. 17. Linked Data Tools (Drupal)• Fields, Views, Views UI• Node Reference• SPARQL Endpoint , SPARQL API• RESTful Web Services• SPARQL Views• RDF External Vocabulary ImporterCaveat: Some modules not ready for Drupal 7 • i.e., Biblio module (no CCK, RDF capabilities)
    18. 18. Disclaimer We are still learning! How to effectively use Drupal What goes into a Digital Library How to best leverage Linked Open Data(Also: We will always be learning.) J. L. Hammett Illustrated Catalogue of School Merchandise 1872-1873…, 1872-1874
    19. 19. What is a Digital Library? More than a virtual stack of books Digital allows more capabilities, access Interlinked Content (See more from this item)What content will be in our digital library? Digitized Books  Lists / Bibliographies Image Library  Smithsonian Publications Collections (of things)  Videos Exhibitions  “Trade Literature” and Databases other non-cataloged items
    20. 20. Knowledge/Data SharingTaxonomic Literature II Index Animalium Essential botanical  35 Volumes reference  430,000 Scientific 15 volumes Names  Each with a citation to 9,000 Botanists first description 37,000 Titles authored  7000+ items in the by these botanists bibliography, many More modern, simpler to linked to WorldCat handle  Older, challenging in nature
    21. 21. Our Process for TL-2Scanned the pagesHired contractor for OCR and correction (99.97% accuracy)Received XML dataset from ContractorVerified and Imported to SQL ServerBuilt a website to search the data
    22. 22. TL-2 Today
    23. 23. Before we import… What exactly does 99.97% accuracy mean? ~12,000 Errors
    24. 24. ImportingMillions of records are no problem formodern databases. But, how to get datainto Drupal? Use existing tools? Create my own import? The Muralo Company Muralo: Sanitary Wall Coatings in the Home, 1912
    25. 25. ImportingImport via existing tools Used Drupal’s Feeds Importer Typically used for importing RSS or similar Fast to set up (< 5 minutes) Slow to import (47,000 records = 8+ hours) Poor error recovery (imported 5 times) What if the data changes in the future? Faster ≠ Better
    26. 26. ImportingWrite my own import. But how? Make a Drupal Module! Steep Learning Curve (many APIs) Faster to set up (48,000 records = 85 minutes) Added bonus: Modules can be versioned Can use the “version update” code to update our data Versioned modules good for Dev / Prod servers
    27. 27. ImportingDigitized Books Online Similar module for importing Module also handles a page for reading books online Uses Internet Archive book reader in an <IFRAME> Links to WorldCat / VIAF FAST Subjects Table of Contents Navigation Eligible for Linked Open Data
    28. 28. Data Schema: British Library
    29. 29. Data SchemaWhat data model are we going to use? British Library Something else?What vocabularies are we using? Dublin Core  FOAF OWL  Event? SKOS  Org? BIBO  Geo? BIO  Our own vocabulary for TL-2
    30. 30. Other ContentGalaxy of Images Image collection of plates from our digitized books 18,000 images and growing Richer set of metadata Data needs to be massaged / imported Images served from another system
    31. 31. Other ContentVideos All are currently on YouTube Will remain there for now Metadata to be imported to Digital Library Will eventually be served from our network
    32. 32. Other Content Collections and Exhibitions Bibliographies, lists, subject guides Trade Literature  Sewing machines!  Scientific equipment!  Seed Catalogs! Smithsonian Publications (DSpace) Smithsonian Libraries Blog Art and Artist Vertical Files W. Atlee Burpee & Co. Burpees New Annual for 1910, 1910
    33. 33. Future Work More planning! Developing a LOD Vocabulary for TL-2 Continued parsing of content in TL-2 Continuing the development of the Index Animalium content Publishing the Index Animalium on the web as LOD How to leverage linked data to create… what? Leopoldo Galluzzo Altre scoverte fatte nella luna dal Sigr. Herschel , 1836
    34. 34. Thank you!Joel