8. http://vajirayana.org
āđāļŦāļĨāđāļāļāđāļāļĄāļđāļĨāļŦāļāļąāļāļŠāļ·āļāđāļāļĒāļāļāļāđāļĨāļāđ
âĒ full library features
âĒ TH/FR/EN
âĒ larger collection
âĒ less features
âĒ TH books only
âĒ text format
âĒ incomplete books
âĒ āļŠāļģāļāļąāļāļāļēāļāļ§āļīāļāļĒāļāļĢāļąāļāļĒāļēāļāļĢ āļāļļāļŽāļēāļĨāļāļāļĢāļāđāļĄāļŦāļēāļ§āļīāļāļĒāļēāļĨāļąāļĒ
âĒ āļŦāļāļŠāļĄāļļāļāļĄāļŦāļēāļ§āļīāļāļĒāļēāļĨāļąāļĒāļāļĢāļĢāļĄāļĻāļēāļŠāļāļĢāđ
âĒ āļŦāļāļŠāļĄāļļāļāļĄāļŦāļēāļ§āļīāļāļĒāļēāļĨāļąāļĒāđāļāļĩāļĒāļāđāļŦāļĄāđ
âĒ āļāļĨāļąāļāļāđāļāļĄāļđāļĨāļāļīāļāļīāļāļąāļĨ āļāļĢāļĄāļĻāļīāļĨāļāļēāļāļĢ
âĒ āļŦāļāļŠāļĄāļļāļāđāļŦāđāļāļāļēāļāļī āļŠāđāļ§āļāļ āļđāļĄāļīāļ āļēāļ
âĒ āļĻāļđāļāļĒāđāļĄāļēāļāļļāļĐāļĒāļ§āļīāļāļĒāļēāļŠāļīāļĢāļīāļāļāļĢ (āļāļāļāđāļāļēāļĢāļĄāļŦāļēāļāļ)
âĒ Wikisource
âĒ Ruern Thai
9. http://vajirayana.org
Book Digitisation
I. āđāļāļĨāđ pdf āļāļĩāđāđāļāđāļāļāđāļāļāļ§āļēāļĄ (Highlight āđāļāđ)
- copy/paste āļŦāļĢāļ·āļ pdf2text tool
- Find/Replace encoded or
unrecognised symbols
- Use VBA script to replace symbols
cannot ïŽnd/replace
II. āđāļāļĨāđāļŠāđāļāļāļŦāļĢāļ·āļāļĢāļđāļāļāđāļēāļĒ
- OCR with Tesseract
- Output ïŽles in .txt or .docx
10. http://vajirayana.org
1. Images Preprocessing
- Convert pdf to jpg
- Page split and clean up
2. OCR
- Tesseract 4.0
- Output ïŽles in .txt, .docx
3. Proof Correction
- Autocorrection scripts
- Human proofread
- Format html
OCR WorkïŽow
11. http://vajirayana.org
1. Image Preprocessing
âĒ Better OCR result images 300dpi, clear, black and white,
no watermark, no book border.
âĒ Convert pdf to jpg/tif: ImageMagick convert
âĒ ImageMagick textcleaner (crop, sharpening, b&w,
rotate, clean up)
âĒ ScanTailor (split pages and clean up)
12. http://vajirayana.org
OCR Engines
ABBYY FineReader Tesseract 4.0 OCRopus
OS Windows, Mac OS X Windows, Linux, Mac OS X FreeBSD, Linux, Mac OS X
User Interface
GUI (with preprocessing,
language detection and
output formats)
CLI CLI
Glyph Training Limited Required large dataset Tools provided
License Commercial, Closed source Apache License v2.0 Apache License v2.0
Developed by A Russian based company Google
German Research Centre for
ArtiïŽcial Intelligence
Thai language Yes Yes No
13. http://vajirayana.org
Tesseract Open Source OCR Engine
âĒ Originally of HP, Since 2006 it has been developed by Google.
âĒ Can recognise more than 100 languages (incl. Thai)
âĒ Result in beta version 4.0 (LSTM based) is much better than stable
version 3 for Thai language.
âĒ Better quality of image, better results.
âĒ Can be trained to recognise other languages.
âĒ Has basic command line usage with API for developers.
âĒ https://github.com/tesseract-ocr/tesseract
14. http://vajirayana.org
2. Running Tesseract OCR
âĒ Run Tesseract command for all page images in a folder
- To ïŽx result with extra spaces use option preserve_interword_spaces=1
tesseract thatest.jpg thatest -l tha --psm 1 --oem 1 -c
preserve_interword_spaces=1 txt
- To ïŽx extra lines from top vowels, increase line height by option textord_min_linesize=3.25
tesseract IMG_5339_L.tif IMG_5339 -l tha --psm 1 --oem 1 -c
textord_min_linesize=3.25 txt
- Multiple languages
tesseract 186.jpg 186 -l tha+eng --psm 1 --oem 1 -c textord_min_linesize=3.25
txt
For more Tesseract command options
tesseract âprint-parameters
15. http://vajirayana.org
3. Proof Correction
âĒ Auto-correction MS Word VBA scripts: Regular
Expressions and recorded ïŽnd/replace words
âĒ Manual proofread on 1st edition book, record replace
words for future autocorrection.
âĒ Annotate page number