conversion of newspapers           to digital objects,       digital data preservation,      and other interesting things ...
sy act                                                                                                                    ...
Photo by DAVID ILIFF. License: CC-BY-SA 3.0                                                                               ...
digitised newspapers                            by the numbers                                                     Monthly...
physical versus digital                                                                  Monthly average                  ...
more numbers!                                                                                             Monthly average ...
what is Alexa?   •    Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is        the...
definitions            •    A PageView is a request for a file whose type is defined as a page.            •    A Unique V...
Alexa ranking world view                                        Alexa 3 month trailing averages 2-Apr-2012                ...
Alexa ranking country view                                               Alexa 3 month trailing averages 2-Apr-2012       ...
where visitors go                                             Alexa 3 month trailing averages 2-Apr-2012                  ...
lots of numbers                                 (sorted by time on site)                                                  ...
even more numbers                                 (sorted by time on site)                                                ...
why digitize newspaper                              collections?digital newspapers enable broader,      easier, and faster...
considerations in newspaper             digitizationTuesday, August 21, 12
selection criteria                    • importance of title                    • complete (no missing issues)             ...
page-level versus article-level      newspaper digitization                                production   copyright         ...
preservation, access, administration         Open Archival Information System              (OAIS) reference modelTuesday, ...
the digitization process                                                                  ingest                    image ...
standard file formats                          • image file formats                             • TIFF                    ...
image decisions                         • image production source materials ?                                             ...
image format comparison                                                                        color        mime          ...
digital library standards      • METS XML for descriptive, structural, technical, and        administrative metadata      ...
Metadata Encoding and                         Transmission Standard       • METS is a XML standard for encoding descriptiv...
METS file structure Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, ...
Metadata Object Description Schema       • MODS is an XML schema for a bibliographic element set that         may be used ...
MODS metadata in METS XML                    <mets:dmdSec ID="issue-nla.news-issn18368190_18740425">                    ! ...
Dublin Core metadata              • Dublin Core is a set of vocabulary terms used to describe                resources for...
Analyzed Layout and Text Object         • ALTO XML provides technical metadata for describing the           layout and con...
Analyzed Layout and Text Object<?xml version="1.0" encoding="UTF-8"?><alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-ins...
Analyzed Layout and Text Object                          bookTuesday, August 21, 12
Analyzed Layout and Text Object                        newspaperTuesday, August 21, 12
Preservation Metadata                         Implementation Strategies           • PREMIS is a core set of implementable ...
PREMIS data in METS file                     <mets:amdSec>                         <mets:techMD ID="PREMISOBJECT1">       ...
Tuesday, August 21, 12
the digitization process                                                                  ingest                    image ...
digitization magic                                    digitization                           images                  objec...
digitization magic                                                             build                     image       layou...
digitization magic                                                             build                     image       layou...
digitization magic                                                               build                     image       lay...
digitization magic                                                                     build                     image    ...
digitization magic                                                                   build                     image      ...
digitization magic                                                                         build                     image...
real world    digitization    production     workflow   • automatic production steps     performed by software   • manual ...
newspaper digitization programs                   around the world                          National Library of Finland (h...
image references and recommendations   • Ian Bogus et al. Minimum Digitization Capture Recommendations (draft). The Associ...
newspaper digitisation references                  Australian Newspapers Digitisation Program                  https://www...
Russian language periodicals              METS/ALTO XML with JPEG2000 images                                     http://bi...
?                         2Tuesday, August 21, 12
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
20120822 conversion of historic newspapers to digital objects [russian state library]
Upcoming SlideShare
Loading in...5
×

20120822 conversion of historic newspapers to digital objects [russian state library]

918

Published on

Published in: News & Politics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
918
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20120822 conversion of historic newspapers to digital objects [russian state library]

  1. 1. conversion of newspapers to digital objects, digital data preservation, and other interesting things Frederick Zarndt Chair, IFLA Newspapers Section frederick@frederickzarndt.comTuesday, August 21, 12
  2. 2. sy act kia W RIA eor NOR RICH G so L CITY DE eity jr jol V joly it jt 33 VOL 1 49 4 44 LON 1 I I 1 j 26 34 3 T il ft nn SOB preyITI N G jj f RM 11 at sod PREN til uil not if MD to PL ull nat tj 1 i i af ane NE f ile fhe 1 lieI bace f t dlo nt ill c3 n 1 a tn 10 A ASE gotich arms A r t r I 1 1 YN nr tiit ih few is ri thit that th t j tf r I 1 nih nie nuh fph months i TT maill MA y T jfe 1 A hlin nian iati hux bux au hib Ii iole casipy tingle chiy lule Z cent 15 cents s wa 8tates about atlie to lave anve for the nor li il k dy nh va naj A nin h 3 aa fot enof aar I C dar 1fol SC uro PI 1 one doz 12 12 1 2 itile ol july of 01 A bihy 9 t a va 1 i fj an ou iiii j i erti SIN ervi S urgle stiers id arthy art of the A and A s min pf af ye 1 et C printy 1 wal fe rinn I aal riun tantrt d flit 1 c A tim aim L t ile ve a fool llo lj 14 nile hild I arief artof why digitize j 1 take A dowding ft I 1 oa Q 00 r nta states hta slates 40 belits 0 I 1 najt majl d e arii arri ii li 11 aI cc lil aw tle tl 1 last Iii s IV hp A each teach J f ir 1 t j goin papers mailed fram bat i k olis qS rH it 0rate k vate leo ohr a haiir sit tare e i 00 fo hale fare ban pay catl ro ov illi receil t Iin tho nt 0 1 1 al all thies I i andje 1 afia yea I luve 1 if hite hate r i state s 8 tates tatos tro i to E MIGRANTS ami tralle T lo ardel is emo 10 gardei emm g gardens cro DAMI GIrl s TRALLy A S trw LL E R 9 grism AU co 0 RT GIA destroyed on tead 4 night by newspapers? 1 1 jan have jau hare mg rant scattie aich co t them emcy rants sc attle bich 1 place pt migrants cattle 1 ii mig fellow ct ien and fe I 0 it caien len nd oar our friends 1 cost of ii rival ai de and elj 74 our marl lilo s gusts ahat marf irim hat eli grants are respect ful y info rill fui ingo j P ar tare insert pd IM ali NEWS inserted iii it would bo ivis doni fa r the bini be wisdom for dont enni 1 cd that there will be a grand ed 1 wiil ill ad doP L gor 25 cents to their friends concert4 iri the Bowery oil the ha b 0 v ry on granis to camp tarther croul tiie canid froin tho city thereby savia their money for 21 evening ott yen cents ten Tan bents for inser tien with inter insertion and leavin leaving tie ve eatables to the vegetables tle inet I 1 grow outtie papcr paper As tha people lore amli love amusement CIS of 20 tind upwards am palli eis and we design to gratify them I etli a 1 en 0 ibd at once 20 cents e ach e ich each comic wees s en ca of COM le p ieces an d serles series i SAN PETE several brethren ary ariy additional information it arrived from san pete ori froin on tuesday most of which will be entirely y bringing 34 M shingles and re- vei 1 vet ts per liil e iii new in this valley and Rome liew s e 0 0 1 1 port all well crops lute but pros- late inal got up expressly fo fertile occa- 11 perous 1 0 is I NEWS IN I all sk for particulars see hand 1 1 sion DELIVERED ot the pot 0 bills delivaRED W art I we jjrc arc informed that estell 1 admittance by tickets ahiel which esteli ficco vill be open each 1 I co of weston mo are running sabbath from 12 to I oclo qt 1 N can bo had at the tithing and frOM L I 1 1 I L ik post office tile gach dach a mail from mo to pacific springs I blo M 1 accommodating all travellers on trav ellers oil WM CLAYTON EN T 8 the route at 50 acts per letter anson 0 A 1 lial N orth kan y on i G 8 L city jlii loth 1800 GP T UT f N S ON OAIS 0 e t h Kah son 6 11 C h july lomm 1860 the council of health meet on tile 1 AN I EL ill 11 1 1 2 N cc 1 wednesday advice gratis froma 3 ad v ice rat is frota nat JOEL II are ARK acher county mill creek 4 411 1 to 1 P M 4 ng A ato ve of cows passed our 0 f- E cwm lios cotton ill t lids cottonwood Bir lIi oBEE ISAAC 11 ignee utah ALL djerf ons talat have q persons dersons of hava fice 6 led tornia lornia drove day eji outdoor gali- eai 7 o s I 1 I 1 ulca App san pete cattle 0 hon ses residing in the we Rica Sam cattie herseg hor seg the tile e- ezra na T hi sox on tooele z EZRA s toselo valloy are tite reby notified that valley phereby ilie em gratson about bis hopi HOLLAD kiak and all the 53 when tho trade the same with of this pointis about 2 weeks bicht I 1 riak wiron thoy acting Bishops the city 0 in 1 liets oi of or others the law re- unless subs subscribers us to rag resha A wiio who undertakes to TWOtwe 1 1 1 thoi r j q ill then to 4 thol quiredlilmog thena ihei contrary pap ers igou r erg nearest thir rik ane ordinal ajg adax aw othernjmgt as iv iown bufay qui i 6 i P 14 a I W amas sit lawn jarro W raj Un Arts io to a u rtsed bit down ril 1 11 9 kesl besida dB blitch n- t l tho vend the I 1 hin seif himself selfTuesday, August 21, 12
  3. 3. Photo by DAVID ILIFF. License: CC-BY-SA 3.0 Monthly average Requests for Visitors Newspapers Population Reading Room Microform Print Australia 22,876,000 5,130 345 240 reading rooms by France Netherlands 65,350,000 16,847,000 3,000 NA 2,000 NA 1,000 NA the numbers New Zealand Norway 4,414,000 4,985,000 NA 600 NA 400 NA NA Singapore 5,184,000 NA 300 NA UK 62,262,000 2,000 6,900 4,816 USA 313,292,000 NA NA NATuesday, August 21, 12
  4. 4. digitised newspapers by the numbers Monthly average Digitised Historical Newspapers Population Unique Visitors Genealogist Other User Age 22,876,000 150,000 50% 50% >55 37,692,000 12,800 65% 35% >50 5,405,000 NA NA NA ? 65,350,000 22,000 NA NA ? 16,847,000 50,000 NA NA ? 4,414,000 83,333 50% NA >50 4,985,000 1,500 NA NA ? 5,184,000 12,400 NA NA ? 62,262,000 NA NA NA ? 313,292,000 NA NA NA ?Tuesday, August 21, 12
  5. 5. physical versus digital Monthly average Requests for Newspapers Digitised Historical Newspapers Population Paper + Microform Unique Visitors 22,876,000 585 150,000 37,692,000 NA 12,800 5,405,000 NA NA 65,350,000 3,000 22,000 16,847,000 NA 50,000 4,414,000 NA 83,333 4,985,000 400 1,500 5,184,000 300 12,400 62,262,000 11,716 NA 313,292,000 NA NATuesday, August 21, 12
  6. 6. more numbers! Monthly average Collection Digitised Historical Newspapers Lines Population Name ~Size [pages] Unique Visitors Genealogist Other Corrected User Age 22,876,000 Trove 5,000,000 150,000 50% 50% 220,000 >55 37,692,000 CDNC 495,000 12,800 65% 35% 31,000 >50 5,405,000 Historical Newspaper Library 2,000,000 NA NA NA NA ? 65,350,000 Gallica 2,200,000 22,000 NA NA NA ? 16,847,000 Historische Kranten 5,000,000 50,000 NA NA NA ? 4,414,000 Papers Past 2,213,000 83,333 50% NA NA >50 4,985,000 NBDigital Aviser 8,100,000 1,500 NA NA NA ? 5,184,000 Newspaper SG 2,400,000 12,400 NA NA NA ? 62,262,000 British Newspaper Archive 4,880,000 NA NA NA NA ? 313,292,000 Chronicling America 4,100,000 NA NA NA NA ?Tuesday, August 21, 12
  7. 7. what is Alexa? • Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Alexa is now a subsidiary of Amazon. • Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat. • Alexa operations includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine. • Alexa continually crawls all publicly-available websites to create a series of snapshots of the web. • Alexa gathers information from a variety of sources to provide key statistics about each site on the web, for example, Traffic Rank, the number of PageViews, and site Speed, Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000 worldwide).Tuesday, August 21, 12
  8. 8. definitions • A PageView is a request for a file whose type is defined as a page. • A Unique Visitor is a uniquely identified client generating requests on the web server or viewing pages within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale. • A Visit is a series of page requests from the same uniquely identified client with a time of no more than 30 minutes between each page request. • Bounce Rate is the percentage of visits where the visitor enters and exits at the same page without visiting any other pages on the site in between. • World | Country Rank is a function of the average daily unique visits and the number of unique pages requested. definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analyticsTuesday, August 21, 12
  9. 9. Alexa ranking world view Alexa 3 month trailing averages 2-Apr-2012 World rank Population Website [Lo is good] 313,292,000 http://www.loc.gov/index.html/ 3,122 22,876,000 http://trove.nla.gov.au/ 16,700 65,350,000 http://www.bnf.fr/ 17,096 62,262,000 http://www.bl.uk/ 27,079 4,414,000 http://www.natlib.govt.nz/ 123,976 62,262,000 http://www.britishnewspaperarchive.co.uk/ 155,259 16,847,000 http://www.kb.nl/ 155,363 5,184,000 http://www.nl.sg/ 156,610 4,985,000 http://www.nb.no/ 189,940 5,405,000 http://www.nationallibrary.fi/ 3,212,803Tuesday, August 21, 12
  10. 10. Alexa ranking country view Alexa 3 month trailing averages 2-Apr-2012 World rank Country rank Population Website [Lo is good] [Lo is good] 5,405,000 http://www.nationallibrary.fi/ 3,212,803 199 22,876,000 http://www.nla.gov.au/ 16,700 375 4,414,000 http://www.natlib.govt.nz/ 123,976 515 65,350,000 http://www.bnf.fr/ 17,096 727 4,985,000 http://www.nb.no/ 189,940 891 313,292,000 http://www.loc.gov/index.html/ 3,122 1,011 5,184,000 http://www.nl.sg/ 156,610 1,208 62,262,000 http://www.bl.uk/ 27,079 2,245 16,847,000 http://www.kb.nl/ 155,363 3,450 62,262,000 http://www.britishnewspaperarchive.co.uk/ 155,259 15,692Tuesday, August 21, 12
  11. 11. where visitors go Alexa 3 month trailing averages 2-Apr-2012 World rank Country rank Population [Lo is good] [Lo is good] Where visitors go [sub-domain] 5,405,000 3,212,803 199 NA NA 22,876,000 16,700 375 http://trove.nla.gov.au/ 57.2% 4,414,000 123,976 515 http://paperspast.natlib.govt.nz/ 50.9% 65,350,000 17,096 727 http://gallica.bnf.fr/ 52.0% 4,985,000 189,940 891 NA NA 313,292,000 3,122 1,011 http://chroniclingamerica.loc.gov/ 4.8% 5,184,000 156,610 1,208 http://newspapers.nl.sg/ 28.0% 62,262,000 27,079 2,245 http://newspapers11.bl.uk/blcs/ 2.5% 16,847,000 155,363 3,450 http://kranten.kb.nl/ 22.4% 62,262,000 155,259 15,692 NA NATuesday, August 21, 12
  12. 12. lots of numbers (sorted by time on site) Alexa 3 month trailing averages 2-Apr-2012 Page views Speed Bounce rate Reputation per visitor Time on site Website [Hi is good] [Lo is good] [Hi is good] [Hi is good] [Hi is good] http://www.britishnewspaperarchive.co.uk/ 51% 28% 485 13.0 11m 40s http://www.bnf.fr/ 71% 35% 13,744 14.9 8m 30s http://www.natlib.govt.nz/ 96% 44% 2,480 5.3 6m 49s http://trove.nla.gov.au/ 42% 55% 9,514 5.4 4m 52s http://www.loc.gov/index.html/ 67% 51% 91,331 5.3 3m 55s http://www.kb.nl/ 89% 54% 3,295 5.0 3m 42s http://www.bl.uk/ 54% 52% 16,191 3.8 3m 2s http://www.nb.no/ 59% 47% 1,579 3.0 2m 57s http://www.nationallibrary.fi/ NA 54% 199 3.1 2m 6s http://www.nl.sg/ 72% 65% 802 2.0 2m 4sTuesday, August 21, 12
  13. 13. even more numbers (sorted by time on site) Alexa 3 month trailing averages 2-Apr-2012 Page views Speed Bounce rate Reputation per visitor Time on site Website [Hi is good] [Lo is good] [Hi is good] [Hi is good] [Hi is good] http://www.ancestry.com/ 32% 24% 20,055 29.9 23m 54s http://www.familysearch.org/ 50% 18% 9,832 15.8 16m 19s http://www.britishnewspaperarchive.co.uk/ 51% 28% 485 13.0 11m 40s http://www.bnf.fr/ 71% 35% 13,744 14.9 8m 30s http://www.natlib.govt.nz/ 96% 44% 2,480 5.3 6m 49s http://trove.nla.gov.au/ 42% 55% 9,514 5.4 4m 52s http://www.loc.gov/index.html/ 67% 51% 91,331 5.3 3m 55s http://www.kb.nl/ 89% 54% 3,295 5.0 3m 42s http://www.bl.uk/ 54% 52% 16,191 3.8 3m 2s http://www.nb.no/ 59% 47% 1,579 3.0 2m 57s http://www.nationallibrary.fi/ NA 54% 199 3.1 2m 6s http://www.nl.sg/ 72% 65% 802 2.0 2m 4sTuesday, August 21, 12
  14. 14. why digitize newspaper collections?digital newspapers enable broader, easier, and faster accessTuesday, August 21, 12
  15. 15. considerations in newspaper digitizationTuesday, August 21, 12
  16. 16. selection criteria • importance of title • complete (no missing issues) • temporal coverage • research value • quality / fragility of original documents • quality of microfilm • etc (other local criteria)Tuesday, August 21, 12
  17. 17. page-level versus article-level newspaper digitization production copyright cost usability accessibility difficulty management page-level $ easy usually simple low good article-level $$$ hard usually complex excellent excellentTuesday, August 21, 12
  18. 18. preservation, access, administration Open Archival Information System (OAIS) reference modelTuesday, August 21, 12
  19. 19. the digitization process ingest image digitization text preservation production images magic objects access accessTuesday, August 21, 12
  20. 20. standard file formats • image file formats • TIFF • JPEG2000 • JPEG • GIF • text file formats • PDF, PDF/A, PDF/A-1b, PDF/A-1a • TEI XML • HTML • plain text • NITF / NewsML • metadata • METS • MODS / PREMIS / ALTO / MIX ...Tuesday, August 21, 12
  21. 21. image decisions • image production source materials ? ¿ • original documents: better quality, more expensive • microfiche: poorer quality, less expensive, microfiche quality varies • bit depth • black-and-white (bitonal) • greyscale • color • resolution • compression • no compression • lossless (reversible) • lossy (irreversible) • image metadataTuesday, August 21, 12
  22. 22. image format comparison color mime 1st public compression bit depth metadata patent management type release JBIG lossless 1-bit no no 2000? (.jbig, .jbg) 8-bit JPEG lossy, DCT, RLE, 12-bit yes yes image/jpeg no 1992 Huffman public.jpeg (.jpg, .jpeg) 24-bit 8-bit yes but many lossless and image/jp2 JPEG2000 lossy compression 16-bit yes yes public.jpeg20 part 1 is 2000 color to 48 patent (.jp2) algorithms bits 0 free none LZW TIFF RLE 1, 2, 4, 8, 16, yes yes image/tiff no 1986 public.tiff (.tiff, .tif) ZIP 24, 32 bits OtherWikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats (accessed August 1, 2012)Tuesday, August 21, 12
  23. 23. digital library standards • METS XML for descriptive, structural, technical, and administrative metadata • descriptive metadata • Metadata Object Description Standard (MODS) selected metadata from MARC • Dublin Core fundamental group of text elements for describing and cataloging • technical metadata • ALTO for OCR text • PREMIS for digital preservation • MIX and ANSI/NISO Z39.87 for imagesTuesday, August 21, 12
  24. 24. Metadata Encoding and Transmission Standard • METS is a XML standard for encoding descriptive, administrative, and structural metadata about objects within a digital library • METS files consist of 7 (optional) sections: header, descriptive, administrative, file map, structural map, structural link, and behavior • METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile • current version 1.9.1 • administered by METS editorial board (international group of volunteers) • standards hosted by Library of Congress at http://www.loc.gov/ standards/mets/Tuesday, August 21, 12
  25. 25. METS file structure Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.Tuesday, August 21, 12
  26. 26. Metadata Object Description Schema • MODS is an XML schema for a bibliographic element set that may be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones • Subset of MARC 21 • Mappings exist between MODS and MARC, Dublin Core, and RDA (conversion tools exist) • May be used in conjunction with METS XML • current version 3.4 • administered by Library of Congress Network Development and MARC Standards Office with help from interested users • standards hosted by Library of Congress at http://www.loc.gov/ standards/mods/Tuesday, August 21, 12
  27. 27. MODS metadata in METS XML <mets:dmdSec ID="issue-nla.news-issn18368190_18740425"> ! <mets:mdWrap MDTYPE="MODS"> ! ! <mets:xmlData> ! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3"> ! ! ! ! <mods:language> ! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm> ! ! ! ! </mods:language> ! ! ! ! <mods:genre>newspaper issue</mods:genre> ! ! ! ! <mods:originInfo> ! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued> ! ! ! ! </mods:originInfo> ! ! ! ! <mods:relatedItem type="host"> ! ! ! ! ! <mods:titleInfo> ! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title> ! ! ! ! ! </mods:titleInfo> ! ! ! ! ! <mods:genre>newspaper</mods:genre> ! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="volume"> ! ! ! ! ! ! ! <mods:number>IX</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="issue"> ! ! ! ! ! ! ! <mods:number>12</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! </mods:relatedItem> ! ! ! </mods:mods> ! ! </mets:xmlData> ! </mets:mdWrap> </mets:dmdSec>Tuesday, August 21, 12
  28. 28. Dublin Core metadata • Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery. • Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85 • Metadata terms last updated 14-Jun-2012 • May be used in conjunction with METS XML • Dublin Core Metadata Initiative (DCMI) is an open organization, incorporated as a public, not-for-profit company in Singapore • Dublin Core Metadata Initiative is hosted at http:// dublincore.org/Tuesday, August 21, 12
  29. 29. Analyzed Layout and Text Object • ALTO XML provides technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper • commonly used in conjunction with METS XML but may be used standalone • current version 2.0 • administered by ALTO editorial board (international group of volunteers) • standards hosted by Library of Congress at http://www.loc.gov/ standards/alto/Tuesday, August 21, 12
  30. 30. Analyzed Layout and Text Object<?xml version="1.0" encoding="UTF-8"?><alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"><Description>! <MeasurementUnit>pixel</MeasurementUnit>! <sourceImageInformation>! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName>! </sourceImageInformation></Description><Styles>! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/>! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles><Layout>! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967">! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/>! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/>! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/>! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/>! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194">! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831">! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831">! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT">! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75">! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/>! ! ! ! ! ! ! <SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/>! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/>! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/>! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46">! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/>! ! ! ! ! ! ! <SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/>! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! </ComposedBlock> ! </PrintSpace> </Page></Layout></alto>Tuesday, August 21, 12
  31. 31. Analyzed Layout and Text Object bookTuesday, August 21, 12
  32. 32. Analyzed Layout and Text Object newspaperTuesday, August 21, 12
  33. 33. Preservation Metadata Implementation Strategies • PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use • In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities • PREMIS data dictionary current version 2.2 • May be used in conjunction with METS XML • PREMIS tools are freely available • PREMIS Maintenance Activity and Editorial Committee has international members from libraries and industry • PREMIS data dictionary is hosted at http://www.loc.gov/ standards/premis/Tuesday, August 21, 12
  34. 34. PREMIS data in METS file <mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec>Tuesday, August 21, 12
  35. 35. Tuesday, August 21, 12
  36. 36. the digitization process ingest image digitization text preservation production images magic objects access accessTuesday, August 21, 12
  37. 37. digitization magic digitization images objects magicTuesday, August 21, 12
  38. 38. digitization magic build image layout images OCR metadata digital objects processing analysis objectsTuesday, August 21, 12
  39. 39. digitization magic build image layout images OCR metadata digital objects processing analysis objects • crop, de-skew, split images • apply image improvement algorithms as needed • sharpening filters • local adaptive thresholding • remove text bleed-thru • etc • create master images • create working imagesTuesday, August 21, 12
  40. 40. digitization magic build image layout images OCR metadata digital objects processing analysis objects • analyze layout of text image • estimate font types and sizes • calculate coordinates of text blocks • determine layout object types (text, illustration, headline, etc)Tuesday, August 21, 12
  41. 41. digitization magic build image layout images OCR metadata digital objects processing analysis objects • perform optical character recognition (OCR) • calculate word and character coordinates • calculate word and character confidences • apply language dictionaries • correct OCR text (optional)Tuesday, August 21, 12
  42. 42. digitization magic build image layout images OCR metadata digital objects processing analysis objects • populate metadata fields • verify / correct page numbers • verify / correct document structureTuesday, August 21, 12
  43. 43. digitization magic build image layout images OCR metadata digital objects processing analysis objects • create METS / ALTO XML files • create image files and image metadata • create PDF files (if required) • verify digital object • calculate file fixity checks (checksums) • perform file validation and verification • perform quality assuranceTuesday, August 21, 12
  44. 44. real world digitization production workflow • automatic production steps performed by software • manual production steps performed by operatorsTuesday, August 21, 12
  45. 45. newspaper digitization programs around the world National Library of Finland (http://digi.kansalliskirjasto.fi/) British Newspaper Archives, British Library (http://www.bl.uk/welcome/ newspapers) National Digital Newspaper Program, Library of Congress (http://chroniclingamerica.loc.gov/) National Library of New Zealand (http://paperspast.natlib.govt.nz/) National Library of Australia, Australian Digital Newspapers Program (http://trove.nla.gov.au/newspaper) Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/) Singapore National Library Board (http://newspapers.nl.sg/) Bibliotheque nationale de France (http://gallica.bnf.fr/) Europeana Newspapers Project, a collaboration of 17 organizations (http://www.europeana-newspapers.eu/) National Library of Latvia (https://periodika.lndb.lv/)Tuesday, August 21, 12
  46. 46. image references and recommendations • Ian Bogus et al. Minimum Digitization Capture Recommendations (draft). The Association for Library Collections and Technical Services. June 2012 (accessed 18 Aug, 2012 at http:// connect.ala.org/node/185648). • Robert Buckley and Simon Tanner. JPEG 2000 as a Preservation and Access Format for the Wellcome Trust Digital Library. Xerox Corporation and King’s College Digital Consultancy for the Wellcome Trust Library. August 2009 (accessed 1 July 2012 at http:// library.wellcome.ac.uk/assets/wtx056572.pdf). • Paolo Buonora and Franco Liberati. A Format for Digital Preservation of Images: A Study on JPEG 2000 File Robustness. D-Lib Magazine. July/August 2008. (accessed 1 July 2012 at http://www.dlib.org/dlib/july08/buonora/07buonora.html). • ANSI/NISO Z39.87-2006. Data Dictionary -- Technical Metadata for Digital Still Images. National Information Standards Organization, Bethesda, Maryland USA. December 2006. (accessed 1 August 2012 at http://www.niso.org/apps/group_public/download.php/6502/ Data%20Dictionary%20-%20Technical%20Metadata%20for%20Digital%20Still %20Images.pdf). • JBIG Standard (accessed 1 August 2012 at http://www.jpeg.org/jbig). • JPEG Standard (accessed 1 August 2012 at http://www.jpeg.org/jpeg). • JPEG2000 Standard (accessed 1 August 2012 at http://www.jpeg.org/jpeg2000/). • TIFF 6.0 Standard (accessed 1 August 2012 at http://partners.adobe.com/public/ developer/tiff). • Many, many others....Tuesday, August 21, 12
  47. 47. newspaper digitisation references Australian Newspapers Digitisation Program https://www.nla.gov.au/ndp/ Europeana Newspapers http://www.europeana-newspapers.eu/ IFLA Newspapers Section http://www.ifla.org/en/newspapers IMPACT Centre of Competence http://www.digitisation.eu/ Koninklijke Bibliotheek Historische Kranten (the Netherlands) http://kranten.kb.nl/about Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/Tuesday, August 21, 12
  48. 48. Russian language periodicals METS/ALTO XML with JPEG2000 images http://bit.ly/russianperiodicals Try crowdsourcing when you visit the URL above! Learn more about the software and crowdsourcing at http://www.dlconsulting.com.Tuesday, August 21, 12
  49. 49. ? 2Tuesday, August 21, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×