From the printed page to discoverable content library camp perth 2010


Published on

Published in: Technology

From the printed page to discoverable content library camp perth 2010

  1. 1. From the Printed Page to Discoverable Content the open source way Steven Miles @stevermiles Tuesday, 18 January 2011
  2. 2. About Me Tuesday, 18 January 2011
  3. 3. About Me Web Application Developer State Library of Western Australia @ Tuesday, 18 January 2011
  4. 4. About Me Web Application Developer State Library of Western Australia @ S.L.U.R.P. Digital Content Ingestion & Integration with LMS PC Reservation PC Reservations and Booking System PLO Public Libraries Online Venues Bookings Venues Booking & Reservation System P.URL Permanent URL Tuesday, 18 January 2011
  5. 5. WARNING !!!! Lots of technical stuff! Tuesday, 18 January 2011
  6. 6. How can I make scanned content more discoverable? presentation Digitisation Indexing Capture DIY Scanner Existing Documents Dual Camera Setup Single Camera Setup Commercial Scanners Image Processing OCR Document Scanners MFD’s Rotation Cropping Normalisation Levels Correction Multi page Tagging Open source Commercial Cuneiform Tesseract Ocropus GOCR Page Layout Analysis Abby Fine Reader Acrobat leptonica Metadata ManualAutomatic PersonsLocations Dates Organisations Locations Formats hOCR Text XML Manual Import Z39.50 SRU/SRW Engine Zebra XML Z39.50 RBMS Postgres MySQL Search Pull from LMS Search Multiple Databases Results Expose Web API’s Other Library Systems Z39.50 SRU/SRW Facets Page Previews Ranked Sortable Filters Web Accessible Simple Keyword Searching Encourage Exploration Tagging Advanced Search Saved Searches Social Sharing, Intergration Web Browser Accessible Auto Updating Downloadable PDF’s User Correctable Text In Document Searching Highlight Search Results Potential Conversion to Other Formats Tuesday, 18 January 2011
  7. 7. Most common process of digitisation for public consumption Scan / Capture Generate PDF OCR Indexed by Content Management System Link to Downloadable PDF(Uncorrected OCR) (Links only to Document) How can we do this better? Tuesday, 18 January 2011
  8. 8. Inspirational Resources National Libraries Australia - Australian Newspapers Google Docs Informit -Text Searchable Content Tuesday, 18 January 2011
  9. 9. Scan / Capture Semi Auto Cropping and Rotation Correction Optimise Each Page for OCR OCR Pages Retain Positional Information (hocr) Post OCR Processing Spell checking & correction of common OCR errors Natural Language Processing Auto Extract Names, Organisations, Locations & Dates from Text and Use for tagging Store as XML Generate Page Level XML Index Files Add/Update XML Indexing Server Fully Automated Process Generate Searchable PDF Generate Web FriendlyVersions of each page Full Text Search Web Services & Z39.50 Downloadable PDF Google Docs Style Interface Individual Line Highlighting to Show search results Proposed Digitisation Process Tuesday, 18 January 2011
  10. 10. Available Open Source Projects Ocropus - Page Layout Analysis Tesseract OCR - OCR Image Magick - Image Processing Index Data Zebra -XML Indexing Index Data Pazpar2 -Federated Search Existing Web Technologies - PHP, HTML, CSS etc Tuesday, 18 January 2011
  11. 11. DIY Book Scanner Project Tuesday, 18 January 2011
  12. 12. Discovery Layer (PHP, HTML,CSS) Federated Search Using PazPar2 - Z39.50, SRU, SRW Full Text Search Zebra - XML Indexer via Z39.50 LMS & External Databases Existing via Z39.50 XML Data Files MARC, Dublin Core, OAI-PM DocumentViewer / Editor (PHP, HTML,CSS) Ingest / Digitisation (PHP,HTML,CSS) OCR & NLP (Document Processing, OCR & Natural Language Processing) DownloadableVersion Automatic Generation of Searchable PDF,Text Files etc (Updated from User Alterations) External Resources Basic Architecture Crowdsourcing OCR Corrections & Possible translation on handwritten documents Tuesday, 18 January 2011
  13. 13. Converting Images for OCR Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCRVersion OCRopus Page Layout Analysis Image Magick Image Manipulation Combined Tuesday, 18 January 2011
  14. 14. Images to Text Image for OCR Processing Tesseract OCR to HOCR File <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" ""><html><head><title></title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta name='ocr-system' content='tesseract'></head> <body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/ eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'> <div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233"> <p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858 230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211 196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></ span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858 230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></ span></span></p> </div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883"> <p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241 1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77 1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></ span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153 1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</ span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241 1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</ span></span></span></p></div><div class='ocr_carea' id='block_1_3' title="bbox 43 1909 533 2404"><p class='ocr_par'> <span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span <document><metadata><title>Eastern Reporter Tuesday, October 5, 2010</title><id>eastern_reporter/2010/10/5</id></metadata> <pages><page id="0" origWidth="3648" origHeight="2736" rotate="-90.5" crop="2199x3321+147+147"/><page id="1" origWidth="3648" origHeight="2736" rotate="91" path="odd/ IMG_0946.JPG" crop="2161x3247+374+274" width="2161" height="3247"><paragraph><line id="line_1_1" top="50" left="201" width="1657" height="180">R r</line></ paragraph><paragraph><line id="line_1_2" top="1855" left="47" width="194" height="27">By LIAM CROY</line></ paragraph><paragraph><line id="line_1_3" top="1910" left="46" width="485" height="24">IN a display of unity, Muslims and Chris-</ line><line id="line_1_4" top="1937" left="45" width="486" height="26">tians gathered at Dianella Uniting Church</line><line id="line_1_5" top="1965" left="45" width="485" height="26">last Thursday to share thei.r experiences</line><line id="line_1_6" top="1993" left="45" width="212" height="24">and pray for peace.</ line></paragraph><paragraph><line id="line_1_7" top="2020" left="79" width="451" height="25">Sheikh Muhammad Agherdien of the</line></paragraph><paragraph><line id="line_1_8" top="2048" left="46" width="484" height="25">Mirrabooka mosque opened the service</line><line id="line_1_9" top="2076" left="46" width="484" height="26">with a verse of the Islamic religious text,</line><line id="line_1_10" top="2103" left="45" width="117" height="20">the Koran:</line></paragraph><paragraph><line id="line_1_11" top="2131" left="79" width="451" height="27">“Oh People! Behold, we have created you</line></paragraph><paragraph><line id="line_1_12" top="2158" left="46" width="331" height="22">all out ofa male and a female.</line></paragraph><paragraph><line id="line_1_13" top="2187" left="79" width="451" height="25">“And we have made you into nations</line></ paragraph><paragraph><line id="line_1_14" top="2214" left="46" Convert HOCR to XML for Storage Sample Auto Generate Tags IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians gathered at [ORG Dianella Uniting Church ] , last Thursday to share thei.r experiences , and pray for peace. Tuesday, 18 January 2011
  15. 15. Demo Tuesday, 18 January 2011
  16. 16. Prototype Interface for Ingesting Pages from Book Scanner Tuesday, 18 January 2011
  17. 17. Perform Basic Image Rotation and Cropping Rotation and Cropping can replicated to other pages Tuesday, 18 January 2011
  18. 18. Prototype Search Pages Results on the left are the Auto Generated facets based on the natural language processing tags Tuesday, 18 January 2011
  19. 19. Viewing Document Pages Tuesday, 18 January 2011
  20. 20. Viewing Document Pages with Highlighted Results Tuesday, 18 January 2011
  21. 21. Editing Document with Auto Updating of Indexer Tuesday, 18 January 2011
  22. 22. Pazar2 can be used to alternative interfaces for search multiple existing catalogs Tuesday, 18 January 2011
  23. 23. Questions? Tuesday, 18 January 2011
  24. 24. More Info & Credits Tesseract-OCR OCRopus Do-It-Yourself Book Scanning CHDK - Canon Hack Development Kit Zebra - XML Indexing PazPar2 -Federated Search Cuneiform EyeFi Python Server standalone-server.html/ hOCR - HTML OCR OpenNLP Illinois Named Entity Tagger Tuesday, 18 January 2011