Your SlideShare is downloading. ×
0
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20120822 conversion of historic newspapers to digital objects [russian state library]

858

Published on

Published in: News & Politics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
858
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. conversion of newspapers to digital objects, digital data preservation, and other interesting things Frederick Zarndt Chair, IFLA Newspapers Section frederick@frederickzarndt.comTuesday, August 21, 12
  • 2. sy act kia W RIA eor NOR RICH G so L CITY DE eity jr jol V joly it jt 33 VOL 1 49 4 44 LON 1 I I 1 j 26 34 3 T il ft nn SOB preyITI N G jj f RM 11 at sod PREN til uil not if MD to PL ull nat tj 1 i i af ane NE f ile fhe 1 lieI bace f t dlo nt ill c3 n 1 a tn 10 A ASE gotich arms A r t r I 1 1 YN nr tiit ih few is ri thit that th t j tf r I 1 nih nie nuh fph months i TT maill MA y T jfe 1 A hlin nian iati hux bux au hib Ii iole casipy tingle chiy lule Z cent 15 cents s wa 8tates about atlie to lave anve for the nor li il k dy nh va naj A nin h 3 aa fot enof aar I C dar 1fol SC uro PI 1 one doz 12 12 1 2 itile ol july of 01 A bihy 9 t a va 1 i fj an ou iiii j i erti SIN ervi S urgle stiers id arthy art of the A and A s min pf af ye 1 et C printy 1 wal fe rinn I aal riun tantrt d flit 1 c A tim aim L t ile ve a fool llo lj 14 nile hild I arief artof why digitize j 1 take A dowding ft I 1 oa Q 00 r nta states hta slates 40 belits 0 I 1 najt majl d e arii arri ii li 11 aI cc lil aw tle tl 1 last Iii s IV hp A each teach J f ir 1 t j goin papers mailed fram bat i k olis qS rH it 0rate k vate leo ohr a haiir sit tare e i 00 fo hale fare ban pay catl ro ov illi receil t Iin tho nt 0 1 1 al all thies I i andje 1 afia yea I luve 1 if hite hate r i state s 8 tates tatos tro i to E MIGRANTS ami tralle T lo ardel is emo 10 gardei emm g gardens cro DAMI GIrl s TRALLy A S trw LL E R 9 grism AU co 0 RT GIA destroyed on tead 4 night by newspapers? 1 1 jan have jau hare mg rant scattie aich co t them emcy rants sc attle bich 1 place pt migrants cattle 1 ii mig fellow ct ien and fe I 0 it caien len nd oar our friends 1 cost of ii rival ai de and elj 74 our marl lilo s gusts ahat marf irim hat eli grants are respect ful y info rill fui ingo j P ar tare insert pd IM ali NEWS inserted iii it would bo ivis doni fa r the bini be wisdom for dont enni 1 cd that there will be a grand ed 1 wiil ill ad doP L gor 25 cents to their friends concert4 iri the Bowery oil the ha b 0 v ry on granis to camp tarther croul tiie canid froin tho city thereby savia their money for 21 evening ott yen cents ten Tan bents for inser tien with inter insertion and leavin leaving tie ve eatables to the vegetables tle inet I 1 grow outtie papcr paper As tha people lore amli love amusement CIS of 20 tind upwards am palli eis and we design to gratify them I etli a 1 en 0 ibd at once 20 cents e ach e ich each comic wees s en ca of COM le p ieces an d serles series i SAN PETE several brethren ary ariy additional information it arrived from san pete ori froin on tuesday most of which will be entirely y bringing 34 M shingles and re- vei 1 vet ts per liil e iii new in this valley and Rome liew s e 0 0 1 1 port all well crops lute but pros- late inal got up expressly fo fertile occa- 11 perous 1 0 is I NEWS IN I all sk for particulars see hand 1 1 sion DELIVERED ot the pot 0 bills delivaRED W art I we jjrc arc informed that estell 1 admittance by tickets ahiel which esteli ficco vill be open each 1 I co of weston mo are running sabbath from 12 to I oclo qt 1 N can bo had at the tithing and frOM L I 1 1 I L ik post office tile gach dach a mail from mo to pacific springs I blo M 1 accommodating all travellers on trav ellers oil WM CLAYTON EN T 8 the route at 50 acts per letter anson 0 A 1 lial N orth kan y on i G 8 L city jlii loth 1800 GP T UT f N S ON OAIS 0 e t h Kah son 6 11 C h july lomm 1860 the council of health meet on tile 1 AN I EL ill 11 1 1 2 N cc 1 wednesday advice gratis froma 3 ad v ice rat is frota nat JOEL II are ARK acher county mill creek 4 411 1 to 1 P M 4 ng A ato ve of cows passed our 0 f- E cwm lios cotton ill t lids cottonwood Bir lIi oBEE ISAAC 11 ignee utah ALL djerf ons talat have q persons dersons of hava fice 6 led tornia lornia drove day eji outdoor gali- eai 7 o s I 1 I 1 ulca App san pete cattle 0 hon ses residing in the we Rica Sam cattie herseg hor seg the tile e- ezra na T hi sox on tooele z EZRA s toselo valloy are tite reby notified that valley phereby ilie em gratson about bis hopi HOLLAD kiak and all the 53 when tho trade the same with of this pointis about 2 weeks bicht I 1 riak wiron thoy acting Bishops the city 0 in 1 liets oi of or others the law re- unless subs subscribers us to rag resha A wiio who undertakes to TWOtwe 1 1 1 thoi r j q ill then to 4 thol quiredlilmog thena ihei contrary pap ers igou r erg nearest thir rik ane ordinal ajg adax aw othernjmgt as iv iown bufay qui i 6 i P 14 a I W amas sit lawn jarro W raj Un Arts io to a u rtsed bit down ril 1 11 9 kesl besida dB blitch n- t l tho vend the I 1 hin seif himself selfTuesday, August 21, 12
  • 3. Photo by DAVID ILIFF. License: CC-BY-SA 3.0 Monthly average Requests for Visitors Newspapers Population Reading Room Microform Print Australia 22,876,000 5,130 345 240 reading rooms by France Netherlands 65,350,000 16,847,000 3,000 NA 2,000 NA 1,000 NA the numbers New Zealand Norway 4,414,000 4,985,000 NA 600 NA 400 NA NA Singapore 5,184,000 NA 300 NA UK 62,262,000 2,000 6,900 4,816 USA 313,292,000 NA NA NATuesday, August 21, 12
  • 4. digitised newspapers by the numbers Monthly average Digitised Historical Newspapers Population Unique Visitors Genealogist Other User Age 22,876,000 150,000 50% 50% >55 37,692,000 12,800 65% 35% >50 5,405,000 NA NA NA ? 65,350,000 22,000 NA NA ? 16,847,000 50,000 NA NA ? 4,414,000 83,333 50% NA >50 4,985,000 1,500 NA NA ? 5,184,000 12,400 NA NA ? 62,262,000 NA NA NA ? 313,292,000 NA NA NA ?Tuesday, August 21, 12
  • 5. physical versus digital Monthly average Requests for Newspapers Digitised Historical Newspapers Population Paper + Microform Unique Visitors 22,876,000 585 150,000 37,692,000 NA 12,800 5,405,000 NA NA 65,350,000 3,000 22,000 16,847,000 NA 50,000 4,414,000 NA 83,333 4,985,000 400 1,500 5,184,000 300 12,400 62,262,000 11,716 NA 313,292,000 NA NATuesday, August 21, 12
  • 6. more numbers! Monthly average Collection Digitised Historical Newspapers Lines Population Name ~Size [pages] Unique Visitors Genealogist Other Corrected User Age 22,876,000 Trove 5,000,000 150,000 50% 50% 220,000 >55 37,692,000 CDNC 495,000 12,800 65% 35% 31,000 >50 5,405,000 Historical Newspaper Library 2,000,000 NA NA NA NA ? 65,350,000 Gallica 2,200,000 22,000 NA NA NA ? 16,847,000 Historische Kranten 5,000,000 50,000 NA NA NA ? 4,414,000 Papers Past 2,213,000 83,333 50% NA NA >50 4,985,000 NBDigital Aviser 8,100,000 1,500 NA NA NA ? 5,184,000 Newspaper SG 2,400,000 12,400 NA NA NA ? 62,262,000 British Newspaper Archive 4,880,000 NA NA NA NA ? 313,292,000 Chronicling America 4,100,000 NA NA NA NA ?Tuesday, August 21, 12
  • 7. what is Alexa? • Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Alexa is now a subsidiary of Amazon. • Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat. • Alexa operations includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine. • Alexa continually crawls all publicly-available websites to create a series of snapshots of the web. • Alexa gathers information from a variety of sources to provide key statistics about each site on the web, for example, Traffic Rank, the number of PageViews, and site Speed, Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000 worldwide).Tuesday, August 21, 12
  • 8. definitions • A PageView is a request for a file whose type is defined as a page. • A Unique Visitor is a uniquely identified client generating requests on the web server or viewing pages within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale. • A Visit is a series of page requests from the same uniquely identified client with a time of no more than 30 minutes between each page request. • Bounce Rate is the percentage of visits where the visitor enters and exits at the same page without visiting any other pages on the site in between. • World | Country Rank is a function of the average daily unique visits and the number of unique pages requested. definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analyticsTuesday, August 21, 12
  • 9. Alexa ranking world view Alexa 3 month trailing averages 2-Apr-2012 World rank Population Website [Lo is good] 313,292,000 http://www.loc.gov/index.html/ 3,122 22,876,000 http://trove.nla.gov.au/ 16,700 65,350,000 http://www.bnf.fr/ 17,096 62,262,000 http://www.bl.uk/ 27,079 4,414,000 http://www.natlib.govt.nz/ 123,976 62,262,000 http://www.britishnewspaperarchive.co.uk/ 155,259 16,847,000 http://www.kb.nl/ 155,363 5,184,000 http://www.nl.sg/ 156,610 4,985,000 http://www.nb.no/ 189,940 5,405,000 http://www.nationallibrary.fi/ 3,212,803Tuesday, August 21, 12
  • 10. Alexa ranking country view Alexa 3 month trailing averages 2-Apr-2012 World rank Country rank Population Website [Lo is good] [Lo is good] 5,405,000 http://www.nationallibrary.fi/ 3,212,803 199 22,876,000 http://www.nla.gov.au/ 16,700 375 4,414,000 http://www.natlib.govt.nz/ 123,976 515 65,350,000 http://www.bnf.fr/ 17,096 727 4,985,000 http://www.nb.no/ 189,940 891 313,292,000 http://www.loc.gov/index.html/ 3,122 1,011 5,184,000 http://www.nl.sg/ 156,610 1,208 62,262,000 http://www.bl.uk/ 27,079 2,245 16,847,000 http://www.kb.nl/ 155,363 3,450 62,262,000 http://www.britishnewspaperarchive.co.uk/ 155,259 15,692Tuesday, August 21, 12
  • 11. where visitors go Alexa 3 month trailing averages 2-Apr-2012 World rank Country rank Population [Lo is good] [Lo is good] Where visitors go [sub-domain] 5,405,000 3,212,803 199 NA NA 22,876,000 16,700 375 http://trove.nla.gov.au/ 57.2% 4,414,000 123,976 515 http://paperspast.natlib.govt.nz/ 50.9% 65,350,000 17,096 727 http://gallica.bnf.fr/ 52.0% 4,985,000 189,940 891 NA NA 313,292,000 3,122 1,011 http://chroniclingamerica.loc.gov/ 4.8% 5,184,000 156,610 1,208 http://newspapers.nl.sg/ 28.0% 62,262,000 27,079 2,245 http://newspapers11.bl.uk/blcs/ 2.5% 16,847,000 155,363 3,450 http://kranten.kb.nl/ 22.4% 62,262,000 155,259 15,692 NA NATuesday, August 21, 12
  • 12. lots of numbers (sorted by time on site) Alexa 3 month trailing averages 2-Apr-2012 Page views Speed Bounce rate Reputation per visitor Time on site Website [Hi is good] [Lo is good] [Hi is good] [Hi is good] [Hi is good] http://www.britishnewspaperarchive.co.uk/ 51% 28% 485 13.0 11m 40s http://www.bnf.fr/ 71% 35% 13,744 14.9 8m 30s http://www.natlib.govt.nz/ 96% 44% 2,480 5.3 6m 49s http://trove.nla.gov.au/ 42% 55% 9,514 5.4 4m 52s http://www.loc.gov/index.html/ 67% 51% 91,331 5.3 3m 55s http://www.kb.nl/ 89% 54% 3,295 5.0 3m 42s http://www.bl.uk/ 54% 52% 16,191 3.8 3m 2s http://www.nb.no/ 59% 47% 1,579 3.0 2m 57s http://www.nationallibrary.fi/ NA 54% 199 3.1 2m 6s http://www.nl.sg/ 72% 65% 802 2.0 2m 4sTuesday, August 21, 12
  • 13. even more numbers (sorted by time on site) Alexa 3 month trailing averages 2-Apr-2012 Page views Speed Bounce rate Reputation per visitor Time on site Website [Hi is good] [Lo is good] [Hi is good] [Hi is good] [Hi is good] http://www.ancestry.com/ 32% 24% 20,055 29.9 23m 54s http://www.familysearch.org/ 50% 18% 9,832 15.8 16m 19s http://www.britishnewspaperarchive.co.uk/ 51% 28% 485 13.0 11m 40s http://www.bnf.fr/ 71% 35% 13,744 14.9 8m 30s http://www.natlib.govt.nz/ 96% 44% 2,480 5.3 6m 49s http://trove.nla.gov.au/ 42% 55% 9,514 5.4 4m 52s http://www.loc.gov/index.html/ 67% 51% 91,331 5.3 3m 55s http://www.kb.nl/ 89% 54% 3,295 5.0 3m 42s http://www.bl.uk/ 54% 52% 16,191 3.8 3m 2s http://www.nb.no/ 59% 47% 1,579 3.0 2m 57s http://www.nationallibrary.fi/ NA 54% 199 3.1 2m 6s http://www.nl.sg/ 72% 65% 802 2.0 2m 4sTuesday, August 21, 12
  • 14. why digitize newspaper collections?digital newspapers enable broader, easier, and faster accessTuesday, August 21, 12
  • 15. considerations in newspaper digitizationTuesday, August 21, 12
  • 16. selection criteria • importance of title • complete (no missing issues) • temporal coverage • research value • quality / fragility of original documents • quality of microfilm • etc (other local criteria)Tuesday, August 21, 12
  • 17. page-level versus article-level newspaper digitization production copyright cost usability accessibility difficulty management page-level $ easy usually simple low good article-level $$$ hard usually complex excellent excellentTuesday, August 21, 12
  • 18. preservation, access, administration Open Archival Information System (OAIS) reference modelTuesday, August 21, 12
  • 19. the digitization process ingest image digitization text preservation production images magic objects access accessTuesday, August 21, 12
  • 20. standard file formats • image file formats • TIFF • JPEG2000 • JPEG • GIF • text file formats • PDF, PDF/A, PDF/A-1b, PDF/A-1a • TEI XML • HTML • plain text • NITF / NewsML • metadata • METS • MODS / PREMIS / ALTO / MIX ...Tuesday, August 21, 12
  • 21. image decisions • image production source materials ? ¿ • original documents: better quality, more expensive • microfiche: poorer quality, less expensive, microfiche quality varies • bit depth • black-and-white (bitonal) • greyscale • color • resolution • compression • no compression • lossless (reversible) • lossy (irreversible) • image metadataTuesday, August 21, 12
  • 22. image format comparison color mime 1st public compression bit depth metadata patent management type release JBIG lossless 1-bit no no 2000? (.jbig, .jbg) 8-bit JPEG lossy, DCT, RLE, 12-bit yes yes image/jpeg no 1992 Huffman public.jpeg (.jpg, .jpeg) 24-bit 8-bit yes but many lossless and image/jp2 JPEG2000 lossy compression 16-bit yes yes public.jpeg20 part 1 is 2000 color to 48 patent (.jp2) algorithms bits 0 free none LZW TIFF RLE 1, 2, 4, 8, 16, yes yes image/tiff no 1986 public.tiff (.tiff, .tif) ZIP 24, 32 bits OtherWikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats (accessed August 1, 2012)Tuesday, August 21, 12
  • 23. digital library standards • METS XML for descriptive, structural, technical, and administrative metadata • descriptive metadata • Metadata Object Description Standard (MODS) selected metadata from MARC • Dublin Core fundamental group of text elements for describing and cataloging • technical metadata • ALTO for OCR text • PREMIS for digital preservation • MIX and ANSI/NISO Z39.87 for imagesTuesday, August 21, 12
  • 24. Metadata Encoding and Transmission Standard • METS is a XML standard for encoding descriptive, administrative, and structural metadata about objects within a digital library • METS files consist of 7 (optional) sections: header, descriptive, administrative, file map, structural map, structural link, and behavior • METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile • current version 1.9.1 • administered by METS editorial board (international group of volunteers) • standards hosted by Library of Congress at http://www.loc.gov/ standards/mets/Tuesday, August 21, 12
  • 25. METS file structure Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.Tuesday, August 21, 12
  • 26. Metadata Object Description Schema • MODS is an XML schema for a bibliographic element set that may be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones • Subset of MARC 21 • Mappings exist between MODS and MARC, Dublin Core, and RDA (conversion tools exist) • May be used in conjunction with METS XML • current version 3.4 • administered by Library of Congress Network Development and MARC Standards Office with help from interested users • standards hosted by Library of Congress at http://www.loc.gov/ standards/mods/Tuesday, August 21, 12
  • 27. MODS metadata in METS XML <mets:dmdSec ID="issue-nla.news-issn18368190_18740425"> ! <mets:mdWrap MDTYPE="MODS"> ! ! <mets:xmlData> ! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3"> ! ! ! ! <mods:language> ! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm> ! ! ! ! </mods:language> ! ! ! ! <mods:genre>newspaper issue</mods:genre> ! ! ! ! <mods:originInfo> ! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued> ! ! ! ! </mods:originInfo> ! ! ! ! <mods:relatedItem type="host"> ! ! ! ! ! <mods:titleInfo> ! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title> ! ! ! ! ! </mods:titleInfo> ! ! ! ! ! <mods:genre>newspaper</mods:genre> ! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="volume"> ! ! ! ! ! ! ! <mods:number>IX</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="issue"> ! ! ! ! ! ! ! <mods:number>12</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! </mods:relatedItem> ! ! ! </mods:mods> ! ! </mets:xmlData> ! </mets:mdWrap> </mets:dmdSec>Tuesday, August 21, 12
  • 28. Dublin Core metadata • Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery. • Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85 • Metadata terms last updated 14-Jun-2012 • May be used in conjunction with METS XML • Dublin Core Metadata Initiative (DCMI) is an open organization, incorporated as a public, not-for-profit company in Singapore • Dublin Core Metadata Initiative is hosted at http:// dublincore.org/Tuesday, August 21, 12
  • 29. Analyzed Layout and Text Object • ALTO XML provides technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper • commonly used in conjunction with METS XML but may be used standalone • current version 2.0 • administered by ALTO editorial board (international group of volunteers) • standards hosted by Library of Congress at http://www.loc.gov/ standards/alto/Tuesday, August 21, 12
  • 30. Analyzed Layout and Text Object<?xml version="1.0" encoding="UTF-8"?><alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"><Description>! <MeasurementUnit>pixel</MeasurementUnit>! <sourceImageInformation>! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName>! </sourceImageInformation></Description><Styles>! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/>! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles><Layout>! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967">! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/>! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/>! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/>! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/>! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194">! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831">! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831">! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT">! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75">! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/>! ! ! ! ! ! ! <SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/>! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/>! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/>! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46">! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/>! ! ! ! ! ! ! <SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/>! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! </ComposedBlock> ! </PrintSpace> </Page></Layout></alto>Tuesday, August 21, 12
  • 31. Analyzed Layout and Text Object bookTuesday, August 21, 12
  • 32. Analyzed Layout and Text Object newspaperTuesday, August 21, 12
  • 33. Preservation Metadata Implementation Strategies • PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use • In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities • PREMIS data dictionary current version 2.2 • May be used in conjunction with METS XML • PREMIS tools are freely available • PREMIS Maintenance Activity and Editorial Committee has international members from libraries and industry • PREMIS data dictionary is hosted at http://www.loc.gov/ standards/premis/Tuesday, August 21, 12
  • 34. PREMIS data in METS file <mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec>Tuesday, August 21, 12
  • 35. Tuesday, August 21, 12
  • 36. the digitization process ingest image digitization text preservation production images magic objects access accessTuesday, August 21, 12
  • 37. digitization magic digitization images objects magicTuesday, August 21, 12
  • 38. digitization magic build image layout images OCR metadata digital objects processing analysis objectsTuesday, August 21, 12
  • 39. digitization magic build image layout images OCR metadata digital objects processing analysis objects • crop, de-skew, split images • apply image improvement algorithms as needed • sharpening filters • local adaptive thresholding • remove text bleed-thru • etc • create master images • create working imagesTuesday, August 21, 12
  • 40. digitization magic build image layout images OCR metadata digital objects processing analysis objects • analyze layout of text image • estimate font types and sizes • calculate coordinates of text blocks • determine layout object types (text, illustration, headline, etc)Tuesday, August 21, 12
  • 41. digitization magic build image layout images OCR metadata digital objects processing analysis objects • perform optical character recognition (OCR) • calculate word and character coordinates • calculate word and character confidences • apply language dictionaries • correct OCR text (optional)Tuesday, August 21, 12
  • 42. digitization magic build image layout images OCR metadata digital objects processing analysis objects • populate metadata fields • verify / correct page numbers • verify / correct document structureTuesday, August 21, 12
  • 43. digitization magic build image layout images OCR metadata digital objects processing analysis objects • create METS / ALTO XML files • create image files and image metadata • create PDF files (if required) • verify digital object • calculate file fixity checks (checksums) • perform file validation and verification • perform quality assuranceTuesday, August 21, 12
  • 44. real world digitization production workflow • automatic production steps performed by software • manual production steps performed by operatorsTuesday, August 21, 12
  • 45. newspaper digitization programs around the world National Library of Finland (http://digi.kansalliskirjasto.fi/) British Newspaper Archives, British Library (http://www.bl.uk/welcome/ newspapers) National Digital Newspaper Program, Library of Congress (http://chroniclingamerica.loc.gov/) National Library of New Zealand (http://paperspast.natlib.govt.nz/) National Library of Australia, Australian Digital Newspapers Program (http://trove.nla.gov.au/newspaper) Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/) Singapore National Library Board (http://newspapers.nl.sg/) Bibliotheque nationale de France (http://gallica.bnf.fr/) Europeana Newspapers Project, a collaboration of 17 organizations (http://www.europeana-newspapers.eu/) National Library of Latvia (https://periodika.lndb.lv/)Tuesday, August 21, 12
  • 46. image references and recommendations • Ian Bogus et al. Minimum Digitization Capture Recommendations (draft). The Association for Library Collections and Technical Services. June 2012 (accessed 18 Aug, 2012 at http:// connect.ala.org/node/185648). • Robert Buckley and Simon Tanner. JPEG 2000 as a Preservation and Access Format for the Wellcome Trust Digital Library. Xerox Corporation and King’s College Digital Consultancy for the Wellcome Trust Library. August 2009 (accessed 1 July 2012 at http:// library.wellcome.ac.uk/assets/wtx056572.pdf). • Paolo Buonora and Franco Liberati. A Format for Digital Preservation of Images: A Study on JPEG 2000 File Robustness. D-Lib Magazine. July/August 2008. (accessed 1 July 2012 at http://www.dlib.org/dlib/july08/buonora/07buonora.html). • ANSI/NISO Z39.87-2006. Data Dictionary -- Technical Metadata for Digital Still Images. National Information Standards Organization, Bethesda, Maryland USA. December 2006. (accessed 1 August 2012 at http://www.niso.org/apps/group_public/download.php/6502/ Data%20Dictionary%20-%20Technical%20Metadata%20for%20Digital%20Still %20Images.pdf). • JBIG Standard (accessed 1 August 2012 at http://www.jpeg.org/jbig). • JPEG Standard (accessed 1 August 2012 at http://www.jpeg.org/jpeg). • JPEG2000 Standard (accessed 1 August 2012 at http://www.jpeg.org/jpeg2000/). • TIFF 6.0 Standard (accessed 1 August 2012 at http://partners.adobe.com/public/ developer/tiff). • Many, many others....Tuesday, August 21, 12
  • 47. newspaper digitisation references Australian Newspapers Digitisation Program https://www.nla.gov.au/ndp/ Europeana Newspapers http://www.europeana-newspapers.eu/ IFLA Newspapers Section http://www.ifla.org/en/newspapers IMPACT Centre of Competence http://www.digitisation.eu/ Koninklijke Bibliotheek Historische Kranten (the Netherlands) http://kranten.kb.nl/about Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/Tuesday, August 21, 12
  • 48. Russian language periodicals METS/ALTO XML with JPEG2000 images http://bit.ly/russianperiodicals Try crowdsourcing when you visit the URL above! Learn more about the software and crowdsourcing at http://www.dlconsulting.com.Tuesday, August 21, 12
  • 49. ? 2Tuesday, August 21, 12
  • 50. Part 2 Short and simple: An overview of digital preservationTuesday, August 21, 12
  • 51. Preservation of software and preservation of data are two sides of the same coin. From February 2011 Workshop for Digital Curators. digital preservationTuesday, August 21, 12
  • 52. preservation Open Archival Information System (OAIS) reference modelTuesday, August 21, 12
  • 53. digitizationTuesday, August 21, 12
  • 54. digitization digital preservationTuesday, August 21, 12
  • 55. digitization ≠ digital preservationTuesday, August 21, 12
  • 56. digitization ≠ digital preservation !Tuesday, August 21, 12
  • 57. Vint Cerf on “bit rot”Tuesday, August 21, 12
  • 58. digital preservation long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is requiredTuesday, August 21, 12
  • 59. tolerance for downtime? tolerance for data loss? • 99.999% availability required? • length of downtime tolerated? • what is the value of the data? • is the data reproducible? at what cost? • what is the mean time to data loss (MTTDL)? • what isTuesday, August 21, 12
  • 60. availability threats • communications failure • internet attacks / vandalism • hardware failure • software failure • power failure • natural disaster • etc ...Tuesday, August 21, 12
  • 61. communication failure redundant, multiple communications channels from independent providersTuesday, August 21, 12
  • 62. internet attacks / vandalism • denial of service • viruses, worms • data vandalism • website vandalismTuesday, August 21, 12
  • 63. hardware failure • hot standby redundant hardware • cold standby redundant hardware • backup and restoreTuesday, August 21, 12
  • 64. software failure • rollback to known working software (some downtime) • known working software on standby redundant hardware (little downtime) • backup and restore (significant downtime)Tuesday, August 21, 12
  • 65. power failure uninterruptible power supplyTuesday, August 21, 12
  • 66. natural disaster • alternate data center • backup and restoreTuesday, August 21, 12
  • 67. digital data risks • standards / format obsolescence • migration to new format, media, or hardware • media obsolescence / decay • bit rotTuesday, August 21, 12
  • 68. format obsolescence remember … WordPerfect ? MARC records ? Adobe Flash ?Tuesday, August 21, 12
  • 69. strategies for format obsolescence • migrate data to new formats • create a computer software museum with virtual machines • format registries • format validators • don’t worry about it!Tuesday, August 21, 12
  • 70. Jeff Rothenberg on format obsolescence “... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats.” Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American. January 1995. Expanded version published February, 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)Tuesday, August 21, 12
  • 71. standard model for format obsolescence • digital format registry collects information about target format • this information is used to build format identification and verification tools • holders of content use these tools to extract metadata from content in target format; metadata is stored with the content • format registry scans computing environment to determine which formats are obsolescent; notifications sent for obsolete formats • on receiving such a notification, someone builds a tool to convert obsolete format to non-obsolete format using the format specification in the registry • on receiving such a notification, holder of content in obsolete format uses conversion tool and content metadata to convert the file in an obsolete format to a file in a non-obsolete formatTuesday, August 21, 12
  • 72. David Rosenthal on format obsolescence “... format obsolescence is a rare problem that happens infrequently to a minority of unpopular formats ...” David Rosenthal. Format obsolescence: Assessing the threat and the defenses. (accessed 1 August 2012 at http://lockss.org/locksswiki/ files/LibraryHighTech2010.pdfTuesday, August 21, 12
  • 73. alternate model for format obsolescence • store only essential data • perform only essential tasks • delay performing tasks as long as possible David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).Tuesday, August 21, 12
  • 74. importance of standards vis-a-vis format obsolescence well-defined standards ... • guide developers in creation of tools • facilitates development of a broad range of tools for any format • allow developers to maintain existing toolsTuesday, August 21, 12
  • 75. data migration risks • file format changes, for example, PDF 1.4 to PDF 1.8 • file name differences, for example, case sensitive /insensitive names, new operating system • extended file attributes • file permissions, for example, BSD Unix drwxr-xr-x@ to Windows file permissions • soft links / hard linksTuesday, August 21, 12
  • 76. media obsolescence • 5 ¼” floppy disks • 8 track tapes • 3 ½” floppy disks • ZIP drives • CD-R, CD-RW, Blu-Ray • DAT tapes • microfilm • etcTuesday, August 21, 12
  • 77. strategies for media obsolescence • migrate data to new media, for example, floppy disks to DVD • create and maintain a computer hardware museumTuesday, August 21, 12
  • 78. media decay a report by NIST and the Library of Congress says ... • virtually all CD-Rs tested indicated an estimated life expectancy beyond 15 years • only 47 percent of recordable DVDs indicated an estimated life expectancy beyond 15 years, some had a life expectancy as short as 1.9 years • in practice actual lifetimes may be considerably shorterTuesday, August 21, 12
  • 79. prevention / detection of media decay • proper storage • data file checksums (MD5, SHA-1, ...) • monitor media integrity • migrate data from old media to new mediaTuesday, August 21, 12
  • 80. bit rot gradual decay of data due to … • storage media failure because of media quality • storage media failure because of improper storage • random events (bit-flip, environmental influences) • software / hardware errorsTuesday, August 21, 12
  • 81. prevention / detection of bit rot • data file fixity check (checksums) such as MD5, SHA-1, ... • monitor file integrity with frequent, corrective audits • duplicate copies, geographically distributedTuesday, August 21, 12
  • 82. distributed decentralized preservation • the more copies, the safer the data • the more independent copies, the safer the data • the more frequently copies are audited, the safer the data Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?Tuesday, August 21, 12
  • 83. distributed decentralized preservation • n+1 copies are safer than n copies • n independent copies on different storage devices / media are safer than n copies on similar or identical storage devices / media • data audited every week is safer than data audited every monthTuesday, August 21, 12
  • 84. LOCKSS Lots Of Copies Keep Stuff Safe LOCKSS box: Open source LOCKSS software installed on a dedicated computer or virtual machine. • It ingests content from target websites using a web crawler similar to those used by search engines. • It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences. • It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available. • It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content. • It dynamically migrates content to new formats as needed for display. From LOCKSS webpages http://www.lockss.org.Tuesday, August 21, 12
  • 85. how LOCKSS works data copied to another LOCKSS box my library LOCKSS box library Y LOCKSS box library X LOCKSS box dataTuesday, August 21, 12
  • 86. how LOCKSS works data audited my library LOCKSS box library Y LOCKSS box library X LOCKSS box audit dataTuesday, August 21, 12
  • 87. how LOCKSS works data audited my library LOCKSS box aud it f ails library Y LOCKSS k audit  o box library X LOCKSS box audit dataTuesday, August 21, 12
  • 88. how LOCKSS works data copied to another LOCKSS box my library LOCKSS box library Y LOCKSS box library X LOCKSS box dataTuesday, August 21, 12
  • 89. private LOCKSS networks Alabama Digital Preservation Network (http:// www.adpn.org/). CLOCKSS (Controlled LOCKSS), a non-profit collaboration of North American, European, and Asian cultural heritage institutions whose purpose is to preserve digital content with LOCKSS (http://www.clockss.org). MetaArchive Cooperative is a digital preservation cooperative created by cultural heritage institutions (http://www.metaarchive.org). • Many others...Tuesday, August 21, 12
  • 90. digital preservation references • Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. Proceedings of a conference on digital preservation held at the National Library of Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/default/files/ ANADP_Educopia_2012.pdf). • David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf). • David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/ACM2010.pdf). • Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American January 1995. Expanded version published February 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf) • Joint Information Systems Committee (JISC) Programme on Digital Preservation at http://www.jisc.ac.uk/preservation. • Library of Congress on Digital Preservation at http://www.digitalpreservation.gov. • Stanford University’s website for LOCKSS at http://www.lockss.org.Tuesday, August 21, 12
  • 91. ? 2Tuesday, August 21, 12
  • 92. Part 3 The importance of communication, specifications, acceptance criteriaTuesday, August 21, 12
  • 93. the problem Wise men learn by other mens mistakes, fools by their own. H. G. WellsTuesday, August 21, 12
  • 94. the problem the 2009 CHAOS Report (The Standish Group) reports that of all software projects surveyed, 44% are “challenged”, 24% failed, and only 32% succeededTuesday, August 21, 12
  • 95. the problem Roger Sessions estimates that the worldwide cost of IT failure is USD $500 billion per month Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple Architectures for Complex Enterprises and many articles. He is a founding member of the Board of Directors of the International Association of Software Architects.Tuesday, August 21, 12
  • 96. the problem in a recent survey of 1230 IT professionals conducted by Embarcadero Technologies, 2 of the 3 biggest project challenges cited by the IT pros are “poor planning” and “poor or no requirements”Tuesday, August 21, 12
  • 97. the problem in a March 2007 web poll conducted by the Computing Technology Industry Association "nearly 28 percent of the more than 1,000 respondents singled out poor communications as the number one cause of project failure"Tuesday, August 21, 12
  • 98. the problem in a white paper written for Project Perfect by Taimour al Neimat, he lists • poor planning • unclear goals and objectives • objectives changing during the project • unrealistic time or resource estimates • lack of executive support and user involvement • failure to communicate and act as a team • inappropriate skills as primary causes for the failure of complex IT projectsTuesday, August 21, 12
  • 99. the problem a recent tender from an (anonymous) government agency • project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000 • 19 pages of definitions, governing law, proposal evaluation criteria, contractual conditions, instructions about tender response format, etc • technical requirements description? < 1 page • data acceptance criteria? “a high level of accuracy”Tuesday, August 21, 12
  • 100. the problem a recent program established by a prominent national library • digitize more than 20 million text pages • high level image and xml requirements • value of work awarded? > USD $5,000,000 • after award of work, METS xml technical requirements expand to 43+ pages from ~3 pages • acceptance criteria? added as an afterthought and not well definedTuesday, August 21, 12
  • 101. the problem acceptance criteria for a digitization program at a prominent library character accuracy > 80% word accuracy > 75% significant word accuracy > 65%Tuesday, August 21, 12
  • 102. the problem typical tender evaluation criteria in priority order 1. understanding of requirements 2. reputation of service bureau 3. priceTuesday, August 21, 12
  • 103. Tuesday, August 21, 12
  • 104. the problem communication acceptance requirementsTuesday, August 21, 12
  • 105. the illusion In theory, theres no difference between theory and practice, but in practice, there is. Anonymous The single biggest problem in communication is the illusion it has taken place. George Bernard ShawTuesday, August 21, 12
  • 106. the illusion waterfall requirements for each product release repeat { gather requirements create architecture design implement test use -or- sell } until (company goes out of business)Tuesday, August 21, 12
  • 107. the illusion requirements a recent tender from an (anonymous) government agency • project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000 • 19 pages of definitions, governing law, proposal evaluation criteria, contractual conditions, instructions about tender response format, etc • technical requirements description? < 1 page • data acceptance criteria? “a high level of accuracy”Tuesday, August 21, 12
  • 108. the illusion acceptance criteria acceptance criteria for a digitization program at a large, well-known, and internationally recognized national library character accuracy > 80% word accuracy > 75% significant word accuracy > 65%Tuesday, August 21, 12
  • 109. the illusion why (better) communication is necessaryCopyright United Media. Used with permission.Tuesday, August 21, 12
  • 110. the fix Experience is that marvelous thing that enables you to recognize a mistake when you make it again. F. P. JonesTuesday, August 21, 12
  • 111. the fix value of simplicity “Perfection is attained, not when there is nothing left to add, but when there is nothing left to take away.” Antoine de St. ExuperyTuesday, August 21, 12
  • 112. the fix value of prototypes and pilot batches “Plan to throw one away; you will anyhow. If there is anything new about the function of a system, the first implementation will have to be redone completely to achieve a satisfactory (i.e., acceptably small, fast, and maintainable) result. It costs a lot less if you plan to have a prototype.” Butler Lampson Butler Lampson was a founding member of Xerox PARC, worked for DEC, and now works at Microsoft Research. He is an adjunct professor at MIT and an ACM Fellow.Tuesday, August 21, 12
  • 113. the fix value of simplicity “There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.” C.A.R. Hoare Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford University, Senior Researcher at Microsoft Research, recipient of the ACM Turing Award, author of many books on computers and software.Tuesday, August 21, 12
  • 114. the fix good requirements • unitary: the requirement addresses one and only one thing • complete: the requirement is fully stated in one place with no missing information • consistent: the requirement does not contradict any other requirement and is fully consistent with all authoritative external documentation • atomic: it does not contain conjunctions, for example, "the code field must validate American and Canadian postal codes" should be written as two separate requirements • traceable: the requirement meets all or part of a business need as stated by stakeholders and authoritatively documentedTuesday, August 21, 12
  • 115. the fix good requirements (continued) • current: the requirement has not been made obsolete by the passage of time • feasible: the requirement can be implemented within the constraints of the project • unambiguous: the requirement is concisely stated without recourse to technical jargon, acronyms • verifiable: the implementation of the requirement can be determined through one of four possible methods: inspection, demonstration, test, or analysisTuesday, August 21, 12
  • 116. the fix requirements and acceptance criteria Wikipedia on data quality: The processes and technologies involved in ensuring the conformance of data values to requirements and acceptance criteriaTuesday, August 21, 12
  • 117. the fix requirements and acceptance criteria “a high level of accuracy”Tuesday, August 21, 12
  • 118. the fix requirements and acceptance criteria “article titles must be 99.5% accurate”Tuesday, August 21, 12
  • 119. the fix requirements and acceptance criteria “article title characters in each issue must be 99.5% accurate, that is, each issue may have no more than 5 errors in 1000 article title characters”Tuesday, August 21, 12
  • 120. the illusion waterfall requirements for each product release repeat { gather requirements create architecture design implement test use -or- sell } until (company goes out of business)Tuesday, August 21, 12
  • 121. the fix agile requirements gather general requirements create architecture build prototype software test repeat { use software adjust prototype and/or add new feature test } until (user says stop or runs out of money)Tuesday, August 21, 12
  • 122. the fix agile data conversion create requirements and acceptance criteria repeat { digitize (small) pilot batch test data against acceptance criteria adjust requirements and acceptance criteria } until (no more adjustments are necessary) digitize more dataTuesday, August 21, 12
  • 123. Tuesday, August 21, 12
  • 124. the fix why (better) communication is necessary “projects are about communication, communication, and communication” Elenbass, B. (2000). “Staging a Project: Are You Setting Your Project Up for Success?”. Proceedings of the Project Management Institute Annual Seminars & Symposiums.Tuesday, August 21, 12
  • 125. the fix simple principles for (good) communication • be impeccable with your word • don’t take anything personally • don’t make assumptions • always do your best • be mindfulTuesday, August 21, 12
  • 126. the fix why (better) communication is necessary no communication ...Tuesday, August 21, 12
  • 127. the fix why (better) communication is necessary no communication ... little communication ...Tuesday, August 21, 12
  • 128. the fix why (better) communication is necessary no communication ... little communication ... poor communication ...Tuesday, August 21, 12
  • 129. the fix why (better) communication is necessary no communication ... little communication ... poor communication ... reduced communication ...Tuesday, August 21, 12
  • 130. the fix why (better) communication is necessary no communication ... little communication ... poor communication ... reduced communication ... ... all result in more assumptions about intent!Tuesday, August 21, 12
  • 131. the fix how do you communicate? • communication is at most 30% verbal! • remainder - 70% or more - is comprised of gestures, facial expressions, tone of voice, posture, odors, ... • telephone communication removes gestures, facial expressions, posture, odors, etc. only words and tone of voice remain • written communication - email, requirements, etc - removes all modes of communication save for wordsTuesday, August 21, 12
  • 132. the fix how to communicate simple keep it simple stupid (KISS principle) repeat say it twice in different ways listen repeat what you hear respect respect yourself and othersTuesday, August 21, 12
  • 133. conclusion for future projects give especial attention to good, open communication clear requirements clear acceptance criteriaTuesday, August 21, 12
  • 134. ? We all admire the wisdom of people who come to us for advice. Jack Herbert Frederick Zarndt Chair, IFLA Newspapers Section frederick@frederickzarndt.com 2Tuesday, August 21, 12

×