Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

British Library - Digitising Historic Newspapers


Published on

Presentation given at Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text workshop held at UKOLN, University of Bath on 24th September 2009

  • Be the first to comment

  • Be the first to like this

British Library - Digitising Historic Newspapers

  2. 2. 825 million pages
  3. 3. BACKGROUND <ul><li>Funding was secured in 2004 & 2007 from JISC to provide on-line access to a mass of historic newspaper content for learning, teaching and research. </li></ul><ul><li>Deliverables </li></ul><ul><ul><li>Scanning of complete newspaper runs held by BL </li></ul></ul><ul><ul><li>3 million pages of C18 & C19 newspapers stabilised and filmed </li></ul></ul><ul><ul><li>Article zoning and page extraction </li></ul></ul><ul><ul><li>OCR of page images </li></ul></ul><ul><ul><li>Production of required metadata </li></ul></ul>
  4. 4. PROJECT AIMS <ul><li>Free access for the academic community to a content-rich online service </li></ul><ul><li>Access to out-of-copyright UK printed material </li></ul><ul><li>Access to a mix of national and provincial newspapers , the majority from new microfilm </li></ul><ul><li>Access to the entire content of each newspaper via OCR, including adverts, pictures, tables and all articles </li></ul>
  5. 5. SELECTION AND CONSULTATION <ul><li>Creation of User Panel of academics </li></ul><ul><ul><li>UK wide coverage, breadth of century, national and regional titles </li></ul></ul><ul><li>Online questionnaire made and the exercise conducted Feb-Mar 2005 </li></ul><ul><ul><li>Users asked to rank titles in order of priority </li></ul></ul><ul><ul><li>Replies endorsed UK wide coverage; relevant mix of national/ regional titles </li></ul></ul>
  6. 6. <ul><li>48 titles </li></ul><ul><li>3 million pages </li></ul><ul><li>10 million articles? </li></ul><ul><li>Variations in quality </li></ul><ul><li>Variations in structure </li></ul><ul><ul><li>Daily vs weekly </li></ul></ul><ul><ul><li>Size </li></ul></ul><ul><ul><li>Layout </li></ul></ul>
  7. 7. PHYSICAL CHARACTERISTICS OF SOURCE MATERIAL Bleed through Stains Tight binding Holes/tears Creases Paper quality Inconsistent inking Dirt Stamps Printer errors Animals Repairs Lamination
  8. 8. High level view of processes Original Metadata XML Encoding Microfilm Digital Images Website/Delivery System
  9. 9. Delivery: Greyscale v Bitonal
  10. 10. OCR: Greyscale v Bitonal
  11. 11. THE OCR CHALLENGE <ul><li>Tiny text </li></ul><ul><li>Varying formats </li></ul><ul><li>Uneven printing </li></ul><ul><li>Vertical skew </li></ul><ul><li>Multiple columns </li></ul>
  12. 12. Optical Character Recognition (OCR) Exceptional Good Poor Worthless THE VIKING'S SONG Now skall to the Vikings, the Vikings so bold, So fearless in battle, so famous of old, With swarthy, tanned features, and long locks of gold; Ahoi ! my bold Vikings, ahoi ! We plunder the noble, we plunder the priest, We rob the fat abbot to furnish our feast, There's no fare so fine as the convent-fed beast; Ahoi ! my bold Vikings, ahoi I What vessels of Venice can vaunt to be lighter? What blades of Toledo can boast being brighter? What man to the Viking can match as a fighter? Ahoi I my bold Vikings, ahoi I Our sword is our father, our ship is our mother, Our shield is our sister, our breastplate our brother,- Thus, ask us our kindred, we say we've no other; Ahoi ! my bold Vikings, ahoi ! So now slack the ropes, turn the sails to the wind, And smartly the reefs of the canvas unbind, As we sweep o'er the ocean more plunder to find; Ahoi ! my bold Vikings, ahoi ! (Exrh-ads from the New York Papers.) JACKSON IONEY. It is with great pleasure that we per- ceive the true Jackson money is now ia circulation.. Half eagles of Jackson coinage are passing freely from, hand to hand this morning, and all who ^get hold of them seem to feel at once the superiority of such real money to the miserable p.laper substitute withl which, the spirit of aristocracy would still continue to cheat the people. The new eagles, half eagles, and quarters are really beautiful coins-at least so we ate assured, in relation to the eagles and quarters, and so we can attestfroux: our own examination, in relation to the halves. The Globe says, &quot;It is de- voutly to be hoped that the Mint may be able to suppl, all the pressing de- mands -on it-and .that overy~indepen- dent citizen may obtain a low pieces to carry and preserve as a charm against the sorceries of the mammoth. I SINGULAR AND SERIOUS ACCIDENT.- -O11 WeU Iwtje e noon Mr. Charles %Vyber, of the Borougll-roadt, V Fleet Prison to visit a friend, and joined a party III room, who entered into the foolish a seincitllt of 1 pcnny-p)icce to the top of thle room, andl eatchivug tle mothli upon its descent to the lloor. Mr.) L sidered a perfect adept at this game bot time Ile&quot; last found its way into thle throat, where it V&apos;t0tdt wards of half an hour. A Surgeon tried tv folCe but being ulnable to do so, lie contrived to mDOVC t into the stomach. Mr. Wyyber was commmpariatLvl lieved by the penny-piece being riemoved out ot the was enabiled, in the evening, to be carried to hackney-coach. la 112 B ik e my lat arrived the >Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath, ' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn- .die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine and grocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;' 4Stalled the AluidonG.: ceror' Lkndon, with sundries; : ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,; ;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman, for eathly Newpot;agd llford; -Tw Br.otherAs, lawces, fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per- wIliti; iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for .:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foir ouck , + iii ballasto I _______~ ~ ~ ~~~Ai
  13. 13. Key factors affecting OCR accuracy <ul><li>Mass production environment – impossible to hand-tweak every image, compromise between time and quality </li></ul><ul><li>Software – always improving and developing </li></ul><ul><li>Quality of text varies within a run – see images </li></ul><ul><li>Complexity of layouts and formats varies between 48 titles </li></ul><ul><li>Microfilm source – doesn’t affect this project as the microfilm is of a very good standard, but could in future projects </li></ul>
  14. 14. Why bother with OCR? <ul><li>Calculating OCR character accuracy is time consuming and ultimately misleading </li></ul><ul><li>Character accuracy vs Word accuracy </li></ul><ul><li>Word accuracy vs Significant Words </li></ul><ul><li>Why OCR? </li></ul><ul><li>Provides smallest level of access into the information </li></ul><ul><li>Size of project is such that detailed descriptions in the metadata are impossible </li></ul>
  15. 16. They had the internet in 1816 ! The Morning Chronicle  (London, England), Saturday, May 18, 1816; Issue 14678
  16. 17. and a DVD in 1803! The Morning Chronicle  (London, England), Friday, June 10, 1803; Issue 10625
  17. 18. Why Good Quality OCR Matters January 1874
  18. 19. Three ways to access information <ul><li>By </li></ul><ul><li>Metadata — title, place of publication, dates of publication, issue number, number of pages, page quality rating, illustration indicator </li></ul><ul><li>Browsing — article images, page images, browse by issue or title </li></ul><ul><li>OCR — actual text of page as rendered by automatic OCR process </li></ul>
  19. 20. Storage <ul><li>TIFF </li></ul><ul><li>Or </li></ul><ul><li>JPEG2000 </li></ul>
  20. 21. Costs
  21. 22. Summary <ul><li>Access is determined by </li></ul><ul><li>The available technology e.g. OCR, document structure analysis </li></ul><ul><li>By the size of the project – mass production environment is a limitation; no hand tweaking </li></ul><ul><li>By the source material – there are limitations with poor source material </li></ul><ul><li>This project has been a trail blazer, complex and challenging. </li></ul><ul><li>We have learnt a great deal, to give users better, quicker and fuller access to the content. </li></ul>
  22. 23. <ul><li> </li></ul><ul><li>[email_address] </li></ul>Thank you