Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CTS *           at    LC **
        Daniel Chudnov - 2010-10-15 - dchud at loc gov
                   Access 2010 - Winnip...
slideshare.net / dchud
work in progress
transfer
verification
 inventory
 reporting
 workflow
notification
   status
   access
hard to show


but that won’t stop me
tinyurl.com/cts2010
when i’m done
 this should
 make sense
NDNP
publishing
            breaking news*

                online

* 100 years after it happens
chroniclingamerica
     .loc.gov
1,442,264
       pages
last year at Access
2,692,369
       pages
this year at Access
went live
spring 2007
first two years	

	

 	

 1.4M
last year	

 	

 	

 	

 	

 	

 	

 2.7M
56 TB content
117 TB in copies
how?
1.
   built a
   better
access system
faster ingest
from 1 month to 1 day
2.
workflow
CTS
does
this
months
               batches



         page counts
press event

             the gap
first batch
received
2005-10

             went live spring 2007
2010-09
              1-2 month lag




          2-3 month lag


       3-4 month lag          2009-09
first CTS workflow
ingest rate
approaches
receipt rate
this makes us
     smile
content
transfer
services
some requirements
LC started
  digitizing
in the 1980s
we have a
   lot
 of stuff
in a
    lot
of places
distributed
 computing
environment
commercial
                    MFT  *

                   license

* Managed File Transfer
buy or build?

why, both, thank you
100s of collections
dozens of
curatorial organizations
lots more
      stuff
coming every day
a
long term project
collecting
     and
making available
services for
 transfer of
   content
any content
lots of transfers
   “movage”
several services
transfer
verification
 inventory
 reporting
 workflow
notification
   status
   access
transfer across

  systems
organizations
    time
content transfer
       is
     risky
copies fail
bits go bad
drives get lost
you forget
what you did
you forget
what you had
people retire
software breaks
hardware breaks
three blizzards
      in
     DC
CTS helps
   make transfers
reliable and resilient
reliable

know when you’ve
   succeeded
BagIt

packing slip
  for data
.
|--   bag-info.txt
|--   bagit.txt
|--   data
|     |-- batch.xml
|     |-- batch_1.xml
|     |-- batch_ne_dewitt_rework...
.
|--
|--
      bag-info.txt
      bagit.txt                        identifies
                                         a b...
.

                               where the
|--   bag-info.txt
|--   bagit.txt
|--   data
|
|
|
      |-- batch.xml
      ...
.
|--   bag-info.txt
|--   bagit.txt
|--   data
|     |-- batch.xml
|     |-- batch_1.xml
|     |-- batch_ne_dewitt_rework...
71607ad119be88c842268a76f0b6b9e9   data/sn99021999/00206538107/1884091301/0621.pdf
c602d2ac07508059ce5f5597e239b97f   data...
indicates two things
1

  what i think
i’m sending you
2

whether you
 received it
just like
      a
packing slip
works across
   space
works across
  systems
works across
   orgs
works across
   time
easy to make
md5deep
BIL

 BagIt
Library
Bagger

desktop GUI
BIL is free software
Bagger will be soon
sf.net/projects/loc-xferutils/
see also:
    BagIt
in Wikipedia

    edsu++
reliability
 through
 bagging
resilience
  through
persistence
verify that
copies succeed
know when
 copies fail
repeat until
copies succeed
debug
    &
diagnose
record all of it
know what you have
 know what you did
inventory
BagIt checksums
    in a DB
content properties
project, process, type
event timeline
receipt
  verification
      QR
    copies
 accept/reject
ingest/release
  comments
life cycle of
  some set
 of content
basic
           facts


                   project
all the copies
                   details
event timeline
comments along the way
life cycle of
   NDNP
    batch
two key things
1
automated workflow
    using jBPM
this part
process definition
 manages the steps
doesn’t let us forget
2
when content partners
         call
   we can answer
   their questions
reporting

answering our
own questions
annual reports
very important
file counts
overall size
    etc.
used to be
very difficult
to determine
now
immediate
 anytime
mostly
NDNP

          newer
         partners
also project
reporting /
  planning
NDNP batches - one awardee
NDNP batches - all awardees
 (same data, CSV export)
provides
5000’ view
workflow
working status
 at a glance
a personalized view
overview of a whole project
overview of a system




            overview of a person
not exactly
“Facebook for bags”
     but kinda
but wait,
there’s more
browse live copies
go right to the content
many benefits
aaaand...

a RESTy web API
we can build
complex workflows
        with
    inventory
  and reporting
      in CTS
we can build
QR/workflow/auditing
  outside of CTS
  with inventory
   and reporting
   through CTS
CTS:
   java, spring, mysql
hibernate, velocity, tiles
   jquery, jBPM, jetty
NDNP:
 python, django,
mysql, solr, apache
nice clean interfaces
   nice separation
different coders,
 different styles
same benefits
from using CTS
what’s next?
many more
content collections
now:

    NDNP
 Web Archives
   NDIIPP
Copyright Cards
next:
       P&P
      G&M
      WDL
       AFC
     Twitter
Copyright EDeposit
also coming:

more simple workflows
“Receive and Copy”
fits many use cases

     receive
    bag/verify
 copy to archival
  copy to access
works for recon
works for new stuff
and,
get past typical problems

       permissions
   insufficient storage
      failed copies
connection
      with
high expectation
and, finally
a UI redesign
thanks!
BagIt - wikipedia

sf.net/projects/loc-xferutils/

    hooray for protovis

@dchud - dchud at loc gov
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
CTS at LC - Access 2010
Upcoming SlideShare
Loading in …5
×

CTS at LC - Access 2010

1,273 views

Published on

CTS at LC, talk given at Access 2010 in Winnipeg.

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

CTS at LC - Access 2010

  1. 1. CTS * at LC ** Daniel Chudnov - 2010-10-15 - dchud at loc gov Access 2010 - Winnipeg * Content Transfer Services follow along at ** Library of Congress slideshare.net / dchud
  2. 2. slideshare.net / dchud
  3. 3. work in progress
  4. 4. transfer verification inventory reporting workflow notification status access
  5. 5. hard to show but that won’t stop me
  6. 6. tinyurl.com/cts2010
  7. 7. when i’m done this should make sense
  8. 8. NDNP
  9. 9. publishing breaking news* online * 100 years after it happens
  10. 10. chroniclingamerica .loc.gov
  11. 11. 1,442,264 pages last year at Access
  12. 12. 2,692,369 pages this year at Access
  13. 13. went live spring 2007
  14. 14. first two years 1.4M last year 2.7M
  15. 15. 56 TB content 117 TB in copies
  16. 16. how?
  17. 17. 1. built a better access system
  18. 18. faster ingest from 1 month to 1 day
  19. 19. 2. workflow
  20. 20. CTS does this
  21. 21. months batches page counts
  22. 22. press event the gap first batch received 2005-10 went live spring 2007
  23. 23. 2010-09 1-2 month lag 2-3 month lag 3-4 month lag 2009-09 first CTS workflow
  24. 24. ingest rate approaches receipt rate
  25. 25. this makes us smile
  26. 26. content transfer services
  27. 27. some requirements
  28. 28. LC started digitizing in the 1980s
  29. 29. we have a lot of stuff
  30. 30. in a lot of places
  31. 31. distributed computing environment
  32. 32. commercial MFT * license * Managed File Transfer
  33. 33. buy or build? why, both, thank you
  34. 34. 100s of collections
  35. 35. dozens of curatorial organizations
  36. 36. lots more stuff coming every day
  37. 37. a long term project
  38. 38. collecting and making available
  39. 39. services for transfer of content
  40. 40. any content
  41. 41. lots of transfers “movage”
  42. 42. several services
  43. 43. transfer verification inventory reporting workflow notification status access
  44. 44. transfer across systems organizations time
  45. 45. content transfer is risky
  46. 46. copies fail
  47. 47. bits go bad
  48. 48. drives get lost
  49. 49. you forget what you did
  50. 50. you forget what you had
  51. 51. people retire
  52. 52. software breaks
  53. 53. hardware breaks
  54. 54. three blizzards in DC
  55. 55. CTS helps make transfers reliable and resilient
  56. 56. reliable know when you’ve succeeded
  57. 57. BagIt packing slip for data
  58. 58. . |-- bag-info.txt |-- bagit.txt |-- data | |-- batch.xml | |-- batch_1.xml | |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | | | |-- 00206538016 | | |-- 0000.jp2 |-- 0000.pdf data in a Bag | | |-- 0000.tif | | |-- 0000.xml | | |-- 0001.jp2 | | |-- 0001.pdf | | |-- 0001.tif | | |-- 0001.xml
  59. 59. . |-- |-- bag-info.txt bagit.txt identifies a bag |-- data | |-- batch.xml | |-- batch_1.xml | |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | |-- 00206538016 | | |-- 0000.jp2 | | |-- 0000.pdf | | |-- 0000.tif | | |-- 0000.xml | | |-- 0001.jp2 | | |-- 0001.pdf | | |-- 0001.tif | | |-- 0001.xml
  60. 60. . where the |-- bag-info.txt |-- bagit.txt |-- data | | | |-- batch.xml |-- batch_1.xml data starts |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | |-- 00206538016 | | |-- 0000.jp2 | | |-- 0000.pdf | | |-- 0000.tif | | |-- 0000.xml | | |-- 0001.jp2 | | |-- 0001.pdf | | |-- 0001.tif | | |-- 0001.xml
  61. 61. . |-- bag-info.txt |-- bagit.txt |-- data | |-- batch.xml | |-- batch_1.xml | |-- batch_ne_dewitt_rework | | |-- 00206538016_batch.xml | | |-- 00206538028_batch.xml | | `-- sn99021999 | `-- sn99021999 | |-- 00206538016 | | |-- 0000.jp2 | | |-- 0000.pdf | | |-- 0000.tif | | |-- 0000.xml | | | | | | |-- 0001.jp2 |-- 0001.pdf | ... packing |-- `-- manifest-md5.txt tagmanifest-md5.txt slip
  62. 62. 71607ad119be88c842268a76f0b6b9e9 data/sn99021999/00206538107/1884091301/0621.pdf c602d2ac07508059ce5f5597e239b97f data/sn99021999/00206538120/1885100601/0831.xml a59795bd1584532d5cbc0b1d82f75cf8 data/sn99021999/00206538016/1880061401/0593.pdf 3c64fac7e2d49671e0d93908ae42a779 data/sn99021999/00206539616/1888101801/0905.xml 03158a560baa7479b3805d2b45ee02cd data/sn99021999/00206538028/1880111501/0405.tif fa56ea18580e1446939ed62709e5b2db data/sn99021999/00206538077/1883061901/1145.pdf bf4fb83ff8305e8256970a3466c1a12d data/sn99021999/00206538120/1885061501/0043.pdf 8f3649fc812de74b9d9443ee90a8ac9c data/sn99021999/00206538120/1885111101/1109.tif e0b83a7f9ca228271fdaecf6348e1cec data/sn99021999/00206538120/1885101201/0871.xml 1c2f84e12792c123ba0aabedd0c0bbad data/sn99021999/00206538107/1884071401/0197.xml 080e557fe9f68037605e5b80df4bc4ac data/sn99021999/0020653820A/1888050701/0543.tif 532efe32c156459d9d9589caf618f502 data/sn99021999/00206538120/1885071401/0250.tif ce607af59a96f2656d9448f38ffda072 data/sn99021999/0020653820A/1888052801/0731.pdf 60b626d8fd40aca1b425e86a004bb055 data/sn99021999/00206539628/1888111801/0088.xml a467cd62350334c7aa83cf1e9056c1c6 data/sn99021999/00206539616/1888091701/0629.jp2 1a434f7a4d843a2c8ffe8d0824fafc3f data/sn99021999/00206538028/1880120801/0482.jp2 22996d89b4a3334256afaddcaa0238d8 data/sn99021999/00206538016/1874102001/0259.jp2 36f550da273ad4c592fee1761c98322a data/sn99021999/00206538016/1880052201/0518.jp2 7f7ccec3f2afae896338498372fd476e data/sn99021999/00206539616/1888080101/0200.pdf c247a5d74d0e7f857c534d935661adbe data/sn99021999/00206538107/1884072601/0286.jp2 4d497a18a154adcc8636239378ab340b data/sn99021999/00206539628/1889021101/0868.pdf 2e8ca2558b54b5c49b2f20a355a60895 data/sn99021999/00206538065/1882092001/0136.xml fb71493048e5010100f18012f5060d42 data/sn99021999/00206538028/1880123001/0569.xml 40b100432890b055a5defbfbea815d57 data/sn99021999/00206538107/1884090901/0590.xml 46f6d61480dadc1c988b0baa4de8b6c4 data/sn99021999/00206539628/1888122801/0463.pdf 1cb8af0648e8c9df395b63226fe7371f data/sn99021999/00206538016/1874101501/0244.pdf 9257834023c683b02f354888b2740b8f data/sn99021999/00206539616/1888102301/0956.xml 0d52b3b2b1c5459b7e8d500a8566b0bf data/sn99021999/00206538120/1885080801/0425.tif
  63. 63. indicates two things
  64. 64. 1 what i think i’m sending you
  65. 65. 2 whether you received it
  66. 66. just like a packing slip
  67. 67. works across space
  68. 68. works across systems
  69. 69. works across orgs
  70. 70. works across time
  71. 71. easy to make
  72. 72. md5deep
  73. 73. BIL BagIt Library
  74. 74. Bagger desktop GUI
  75. 75. BIL is free software Bagger will be soon
  76. 76. sf.net/projects/loc-xferutils/
  77. 77. see also: BagIt in Wikipedia edsu++
  78. 78. reliability through bagging
  79. 79. resilience through persistence
  80. 80. verify that copies succeed
  81. 81. know when copies fail
  82. 82. repeat until copies succeed
  83. 83. debug & diagnose
  84. 84. record all of it
  85. 85. know what you have know what you did
  86. 86. inventory
  87. 87. BagIt checksums in a DB
  88. 88. content properties project, process, type
  89. 89. event timeline
  90. 90. receipt verification QR copies accept/reject ingest/release comments
  91. 91. life cycle of some set of content
  92. 92. basic facts project all the copies details
  93. 93. event timeline
  94. 94. comments along the way
  95. 95. life cycle of NDNP batch
  96. 96. two key things
  97. 97. 1 automated workflow using jBPM
  98. 98. this part
  99. 99. process definition manages the steps doesn’t let us forget
  100. 100. 2 when content partners call we can answer their questions
  101. 101. reporting answering our own questions
  102. 102. annual reports very important
  103. 103. file counts overall size etc.
  104. 104. used to be very difficult to determine
  105. 105. now immediate anytime
  106. 106. mostly NDNP newer partners
  107. 107. also project reporting / planning
  108. 108. NDNP batches - one awardee
  109. 109. NDNP batches - all awardees (same data, CSV export)
  110. 110. provides 5000’ view
  111. 111. workflow
  112. 112. working status at a glance
  113. 113. a personalized view
  114. 114. overview of a whole project
  115. 115. overview of a system overview of a person
  116. 116. not exactly “Facebook for bags” but kinda
  117. 117. but wait, there’s more
  118. 118. browse live copies
  119. 119. go right to the content
  120. 120. many benefits
  121. 121. aaaand... a RESTy web API
  122. 122. we can build complex workflows with inventory and reporting in CTS
  123. 123. we can build QR/workflow/auditing outside of CTS with inventory and reporting through CTS
  124. 124. CTS: java, spring, mysql hibernate, velocity, tiles jquery, jBPM, jetty
  125. 125. NDNP: python, django, mysql, solr, apache
  126. 126. nice clean interfaces nice separation
  127. 127. different coders, different styles
  128. 128. same benefits from using CTS
  129. 129. what’s next?
  130. 130. many more content collections
  131. 131. now: NDNP Web Archives NDIIPP Copyright Cards
  132. 132. next: P&P G&M WDL AFC Twitter Copyright EDeposit
  133. 133. also coming: more simple workflows
  134. 134. “Receive and Copy”
  135. 135. fits many use cases receive bag/verify copy to archival copy to access
  136. 136. works for recon works for new stuff
  137. 137. and, get past typical problems permissions insufficient storage failed copies
  138. 138. connection with high expectation
  139. 139. and, finally a UI redesign
  140. 140. thanks!
  141. 141. BagIt - wikipedia sf.net/projects/loc-xferutils/ hooray for protovis @dchud - dchud at loc gov

×