Your SlideShare is downloading. ×
Repository Development at LC - Access 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Repository Development at LC - Access 2009

483
views

Published on

Given at Access 2009 in Charlottetown, PEI. Watch video of the actual talk at http://hosting2.epresence.tv/UPEI/1/watch/72.aspx

Given at Access 2009 in Charlottetown, PEI. Watch video of the actual talk at http://hosting2.epresence.tv/UPEI/1/watch/72.aspx

Published in: Technology, News & Politics

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
483
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Repository Development at LC Daniel Chudnov - 2009-10-01 - dchud at loc gov Access 2009 - Charlottetown, PEI
  • 2. who we are what we do what’s next
  • 3. who we are
  • 4. 30ish people dev, QA, PM, ops from libs, uni, industry, etc.
  • 5. OSI Office of Strategic Initiatives
  • 6. “...capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information for registration, cataloging, indexing, and preservation.” (search for “LC21”)
  • 7. or, to be precise
  • 8. capture the “ digital artifact register and/or deposit it for the Copyright Office, pass , it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information for registration, cataloging, indexing, and preservation.” (search for “LC21”)
  • 9. “capture the digital artifact, register and/or deposit it for the Copyright Office , pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information for registration, cataloging, indexing, and preservation.” (search for “LC21”)
  • 10. “capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information for registration, cataloging, indexing, and preservation.” (search for “LC21”)
  • 11. “capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection with the optimum flow-through of information for , registration, cataloging, indexing, and preservation.” (search for “LC21”)
  • 12. “capture the digital artifact, register and/or deposit it for the Copyright Office, pass it along to those who decide whether to include it in the Library, and allow it to be incorporated digitally in the collection, with the optimum flow-through of information for registration, cataloging, indexing, and preservation .” (search for “LC21”)
  • 13. what we do
  • 14. “capture the digital artifact”
  • 15. at scale
  • 16. world scale then web scale
  • 17. wdl.org
  • 18. partners all over the world
  • 19. content from all over the world
  • 20. users all over the world
  • 21. wdl.org/ru/
  • 22. wdl.org/zh/
  • 23. wdl.org/ar/
  • 24. launched April 2009
  • 25. lots of press
  • 26. 9,026 req/s 1.25 Gbit/s on day one
  • 27. no crash just bugs (yay!)
  • 28. that was new for LC
  • 29. how?
  • 30. solaris apache nginx mysql solr django jquery
  • 31. clean URIs static pages
  • 32. global edge caching
  • 33. what we do
  • 34. capture the artifact pass it along cataloging, indexing
  • 35. chroniclingamerica.loc.gov
  • 36. 139,582 title records
  • 37. 1,442,462 pages
  • 38. freely available now download whole issues - tell friends - mash it up
  • 39. 100+ TB 16 of 50+ states/terr. and growing quickly
  • 40. how?
  • 41. solaris apache mysql solr django
  • 42. clean URIs page caching
  • 43. capture the artifact pass it along cataloging, indexing, preservation
  • 44. preservation storage “movage”
  • 45. capture the artifact
  • 46. BagIt packing slip for data
  • 47. . |-- bag-info.txt |-- bagit.txt |-- data | |-- batch.xml | |-- batch_1.xml | |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | | | |-- 00206538016 | | |-- 0000.jp2 |-- 0000.pdf data in a Bag | | |-- 0000.tif | | |-- 0000.xml | | |-- 0001.jp2 | | |-- 0001.pdf | | |-- 0001.tif | | |-- 0001.xml
  • 48. . |-- |-- bag-info.txt bagit.txt identifies a bag |-- data | |-- batch.xml | |-- batch_1.xml | |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | |-- 00206538016 | | |-- 0000.jp2 | | |-- 0000.pdf | | |-- 0000.tif | | |-- 0000.xml | | |-- 0001.jp2 | | |-- 0001.pdf | | |-- 0001.tif | | |-- 0001.xml
  • 49. . where the |-- bag-info.txt |-- bagit.txt |-- data | | | |-- batch.xml |-- batch_1.xml data starts |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | |-- 00206538016 | | |-- 0000.jp2 | | |-- 0000.pdf | | |-- 0000.tif | | |-- 0000.xml | | |-- 0001.jp2 | | |-- 0001.pdf | | |-- 0001.tif | | |-- 0001.xml
  • 50. . |-- bag-info.txt |-- bagit.txt |-- data | |-- batch.xml | |-- batch_1.xml | |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | |-- 00206538016 | | |-- 0000.jp2 | | |-- 0000.pdf | | |-- 0000.tif | | |-- 0000.xml | | | | | | |-- 0001.jp2 |-- 0001.pdf | ... packing |-- `-- manifest-md5.txt tagmanifest-md5.txt slip
  • 51. 71607ad119be88c842268a76f0b6b9e9 data/sn99021999/00206538107/1884091301/0621.pdf c602d2ac07508059ce5f5597e239b97f data/sn99021999/00206538120/1885100601/0831.xml a59795bd1584532d5cbc0b1d82f75cf8 data/sn99021999/00206538016/1880061401/0593.pdf 3c64fac7e2d49671e0d93908ae42a779 data/sn99021999/00206539616/1888101801/0905.xml 03158a560baa7479b3805d2b45ee02cd data/sn99021999/00206538028/1880111501/0405.tif fa56ea18580e1446939ed62709e5b2db data/sn99021999/00206538077/1883061901/1145.pdf bf4fb83ff8305e8256970a3466c1a12d data/sn99021999/00206538120/1885061501/0043.pdf 8f3649fc812de74b9d9443ee90a8ac9c data/sn99021999/00206538120/1885111101/1109.tif e0b83a7f9ca228271fdaecf6348e1cec data/sn99021999/00206538120/1885101201/0871.xml 1c2f84e12792c123ba0aabedd0c0bbad data/sn99021999/00206538107/1884071401/0197.xml 080e557fe9f68037605e5b80df4bc4ac data/sn99021999/0020653820A/1888050701/0543.tif 532efe32c156459d9d9589caf618f502 data/sn99021999/00206538120/1885071401/0250.tif ce607af59a96f2656d9448f38ffda072 data/sn99021999/0020653820A/1888052801/0731.pdf 60b626d8fd40aca1b425e86a004bb055 data/sn99021999/00206539628/1888111801/0088.xml a467cd62350334c7aa83cf1e9056c1c6 data/sn99021999/00206539616/1888091701/0629.jp2 1a434f7a4d843a2c8ffe8d0824fafc3f data/sn99021999/00206538028/1880120801/0482.jp2 22996d89b4a3334256afaddcaa0238d8 data/sn99021999/00206538016/1874102001/0259.jp2 36f550da273ad4c592fee1761c98322a data/sn99021999/00206538016/1880052201/0518.jp2 7f7ccec3f2afae896338498372fd476e data/sn99021999/00206539616/1888080101/0200.pdf c247a5d74d0e7f857c534d935661adbe data/sn99021999/00206538107/1884072601/0286.jp2 4d497a18a154adcc8636239378ab340b data/sn99021999/00206539628/1889021101/0868.pdf 2e8ca2558b54b5c49b2f20a355a60895 data/sn99021999/00206538065/1882092001/0136.xml fb71493048e5010100f18012f5060d42 data/sn99021999/00206538028/1880123001/0569.xml 40b100432890b055a5defbfbea815d57 data/sn99021999/00206538107/1884090901/0590.xml 46f6d61480dadc1c988b0baa4de8b6c4 data/sn99021999/00206539628/1888122801/0463.pdf 1cb8af0648e8c9df395b63226fe7371f data/sn99021999/00206538016/1874101501/0244.pdf 9257834023c683b02f354888b2740b8f data/sn99021999/00206539616/1888102301/0956.xml 0d52b3b2b1c5459b7e8d500a8566b0bf data/sn99021999/00206538120/1885080801/0425.tif
  • 52. defines two things
  • 53. 1 what i think i’m sending you
  • 54. 2 whether you received it
  • 55. just like a packing slip
  • 56. works across space
  • 57. works across systems
  • 58. works across orgs
  • 59. works across time
  • 60. easy to make
  • 61. md5deep
  • 62. BIL BagIt Library
  • 63. bvar@sun9 /ingest/bvar/test $ bag create --dest new_bag test_data/* 12:08:47,044 [main] INFO CommandLineBagDriver : Performing operation: create 2.301112941466272:2.3 12:08:47,141 [main] INFO ManifestImpl : Creating manifest for manifest-md5.txt 12:09:09,493 [main] INFO ManifestImpl : Creating manifest for tagmanifest-md5.txt 12:09:09,511 [main] INFO AbstractBagImpl : Writing bag 12:09:41,507 [main] INFO CommandLineBagDriver : Operation completed. 12:09:41,508 [main] INFO CommandLineBagDriver : Returning 0 bvar@sun9 /ingest/bvar/push/test_bag $ bag isvalid . 11:55:45,582 [main] INFO CommandLineBagDriver : Performing operation: isvalid 11:55:46,378 [main] INFO ManifestImpl : Creating manifest for manifest-md5.txt 11:55:46,458 [main] INFO ManifestImpl : Creating manifest for tagmanifest-md5.txt 11:55:46,540 [main] INFO AbstractBagImpl : Completion check: Result is true. 11:56:21,273 [main] INFO AbstractBagImpl : Validity check: Result is true. 11:56:21,273 [main] INFO CommandLineBagDriver : Result is true. 11:56:21,274 [main] INFO CommandLineBagDriver : Returning 0 bvar@sun9 /ingest/bvar/push/test_bag $
  • 64. Bagger
  • 65. free/open source releases from LC
  • 66. sf.net/projects/loc-xferutils/ get yours today - tell friends - start trading bags
  • 67. that was new for LC
  • 68. pass it along
  • 69. transfer inventory workflow
  • 70. transfer UI - inventory - workflow
  • 71. how?
  • 72. apache spring/mvc hibernate mysql
  • 73. and other automation strategies
  • 74. lots of work still to do
  • 75. lots of integration still to do
  • 76. register/deposit for Copyright
  • 77. not my area, but
  • 78. we hope to support eDeposit with these tools
  • 79. “Deposit Demand” June 2009 Federal Register Proposed Rulemaking
  • 80. stay tuned or ask my colleagues :) (ask me whom to ask)
  • 81. but, not my area
  • 82. “allow it to be... incorporated digitally in the collection”
  • 83. “allow it to be... incorporated digitally in the collection”
  • 84. how?
  • 85. traditional approach: catalog records exhibit sites
  • 86. cost of integrating everything is high
  • 87. cost of updating everything is high
  • 88. cost of consistent web strategies is low
  • 89. for example
  • 90. Linked Data
  • 91. use URIs as names for things use HTTP URIs provide useful information include links to other URIs http://www.w3.org/DesignIssues/LinkedData.html
  • 92. id.loc.gov
  • 93. LCSH on the web free
  • 94. clean URIs follow your nose formats
  • 95. view source
  • 96. <link rel="alternate" type="application/rdf+xml" href="/authorities/sh00009460.rdf" /> <link rel="alternate" type="text/plain" href="/authorities/sh00009460.nt" /> <link rel="alternate" type="application/json" href="/authorities/sh00009460.json" />
  • 97. <rdf:RDF> <rdf:Description rdf:about="http://id.loc.gov/authorities/ sh00009460#concept"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/ XMLSchema#dateTime">2000-11-27T10:39:57-04:00</dcterms:modified> <skos:prefLabel xml:lang="en">National parks and reserves--Prince Edward Island</skos:prefLabel> <owl:sameAs rdf:resource="info:lc/authorities/sh00009460"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#topicalTerms"/> <dcterms:created rdf:datatype="http://www.w3.org/2001/ XMLSchema#dateTime">2000-10-17T00:00:00-04:00</dcterms:created> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh2002010534#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh2008004743#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh2003002637#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh00009458#concept"/> </rdf:Description> <rdf:Description rdf:about="http://id.loc.gov/authorities/ sh2002010534#concept"> <skos:prefLabel xml:lang="en">Prince Edward Island National Park (P.E.I.) </skos:prefLabel> </rdf:Description>
  • 98. <rdf:RDF> <rdf:Description rdf:about="http://id.loc.gov/authorities/ sh00009460#concept"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/ XMLSchema#dateTime">2000-11-27T10:39:57-04:00</dcterms:modified> <skos:prefLabel xml:lang="en">National parks and reserves--Prince Edward Island</skos:prefLabel> <owl:sameAs rdf:resource="info:lc/authorities/sh00009460"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/> <skos:inScheme rdf:resource="http://id.loc.gov/authorities#topicalTerms"/> <dcterms:created rdf:datatype="http://www.w3.org/2001/ XMLSchema#dateTime">2000-10-17T00:00:00-04:00</dcterms:created> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh2002010534#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh2008004743#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh2003002637#concept"/> <skos:narrower rdf:resource="http://id.loc.gov/authorities/ sh00009458#concept"/> </rdf:Description> <rdf:Description rdf:about="http://id.loc.gov/authorities/ sh2002010534#concept"> <skos:prefLabel xml:lang="en">Prince Edward Island National Park (P.E.I.) </skos:prefLabel> </rdf:Description> explicit concepts, schema, meaning
  • 99. a web of data...
  • 100. ...with precise meaning
  • 101. at this URI is this concept with this meaning
  • 102. a standard way to refer to a heading
  • 103. freely available now download the whole thing - tell friends - amaze enemies
  • 104. that was new for LC
  • 105. another example
  • 106. <link rel="resourcemap" type="application/rdf+xml" href="/lccn/ sn83030214/1905-01-15/ed-1/seq-25.rdf" /> <link rel="alternate" type="image/jp2" href="/lccn/sn83030214/1905-01-15/ ed-1/seq-25.jp2" /> <link rel="alternate" type="application/pdf" href="/lccn/ sn83030214/1905-01-15/ed-1/seq-25.pdf" /> <link rel="alternate" type="application/xml" href="/lccn/ sn83030214/1905-01-15/ed-1/seq-25/ocr.xml" /> <link rel="alternate" type="text/plain" href="/lccn/ sn83030214/1905-01-15/ed-1/seq-25/ocr.txt" />
  • 107. <rdf:Description rdf:about="/lccn/sn83030214/1905-01-15/ed-1/ seq-25#page"> <ore:isDescribedBy rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25.rdf"/> <foaf:depiction rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25/thumbnail.jpg"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25.jp2"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25/ocr.txt"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25.pdf"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25/ocr.xml"/> <ore:aggregates rdf:resource="/lccn/sn83030214/1905-01-15/ed-1/ seq-25/thumbnail.jpg"/> <rdf:type rdf:resource="http://chroniclingamerica.loc.gov/ terms#Page"/> <ore:isAggregatedBy rdf:resource="/lccn/sn83030214/1905-01-15/ ed-1#issue"/> <dcterms:issued rdf:datatype="http://www.w3.org/2001/ XMLSchema#date">1905-01-15</dcterms:issued> <ndnp:sequence rdf:datatype="http://www.w3.org/2001/ XMLSchema#long">25</ndnp:sequence> <dcterms:title>New-York tribune. - 1905-01-15 - 25</dcterms:title> </rdf:Description>
  • 108. OAI-ORE aggregation
  • 109. this is a page
  • 110. it has these files in these formats
  • 111. it is this sequence number
  • 112. it is part of this issue
  • 113. it has this issue date
  • 114. it has this title
  • 115. all explicit concepts
  • 116. all exposed in the app on the web
  • 117. that was new for LC
  • 118. the web is the API
  • 119. the web is the API
  • 120. there’s an API doc...
  • 121. ...it’s just a bunch of links
  • 122. “...make resources available and useful ...” from the mission of the Library
  • 123. “allow it to be... incorporated digitally in the collection” from the LC21 report
  • 124. “...sustain and preserve a universal collection ...” from the mission of the Library
  • 125. each app consistent about meaning
  • 126. follow your nose to concept definitions
  • 127. in our apps and in yours
  • 128. distributed conceptual integration
  • 129. the web is a universal collection
  • 130. this is a way to incorporate digitally
  • 131. our digital artifacts on our web
  • 132. your digital artifacts in your web
  • 133. our digital artifacts in your web
  • 134. your digital artifacts in our web
  • 135. available & useful &c.
  • 136. summary
  • 137. content that scales on the way in
  • 138. apps that scale on the way out
  • 139. movage movage movage
  • 140. transfer inventory workflow all in active development
  • 141. the BagIt spec try it - it works
  • 142. free/open source software releases
  • 143. free data you can use
  • 144. web of data available and useful
  • 145. view source: wdl.org chroniclingamerica.loc.gov id.loc.gov sf.net/projects/loc-xferutils/ dchud at loc gov - @dchud