This document provides an overview of building an automatic speech recognition (ASR) engine. It discusses speech as a natural modality with high throughput that needs to account for errors in ASR output. It describes the components of a dialogue system including the ASR, natural language understanding, text-to-speech, and a dialogue manager. The document then discusses the components inside the recognizer including the acoustic model, language model, feature extraction using MFCC, and decoding using techniques like beam search. It also discusses topics like building the lexicon, acoustic modeling, and using deep learning approaches in ASR.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against developing mental illness and improve symptoms for those who already suffer from conditions like anxiety and depression.
The document outlines judging criteria for the team "เผือกหอม" including their use of Line and Google technologies to create an awesome and creative idea that integrates multiple APIs and uses technologies like deep learning/AI to turn public spaces like restaurants into social spaces by empowering people to control their environment through innovative applications of game theory, auction theory, and other advanced techniques with a large potential market size in Thailand.
This document provides an overview of building an automatic speech recognition (ASR) engine. It discusses speech as a natural modality with high throughput that needs to account for errors in ASR output. It describes the components of a dialogue system including the ASR, natural language understanding, text-to-speech, and a dialogue manager. The document then discusses the components inside the recognizer including the acoustic model, language model, feature extraction using MFCC, and decoding using techniques like beam search. It also discusses topics like building the lexicon, acoustic modeling, and using deep learning approaches in ASR.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against developing mental illness and improve symptoms for those who already suffer from conditions like anxiety and depression.
The document outlines judging criteria for the team "เผือกหอม" including their use of Line and Google technologies to create an awesome and creative idea that integrates multiple APIs and uses technologies like deep learning/AI to turn public spaces like restaurants into social spaces by empowering people to control their environment through innovative applications of game theory, auction theory, and other advanced techniques with a large potential market size in Thailand.
8. http://vajirayana.org
แหล่งข้อมูลหนังสือไทยออนไลน์
• full library features
• TH/FR/EN
• larger collection
• less features
• TH books only
• text format
• incomplete books
• สำนักงานวิทยทรัพยากร จุฬาลงกรณ์มหาวิทยาลัย
• หอสมุดมหาวิทยาลัยธรรมศาสตร์
• หอสมุดมหาวิทยาลัยเชียงใหม่
• คลังข้อมูลดิจิทัล กรมศิลปากร
• หอสมุดแห่งชาติ ส่วนภูมิภาค
• ศูนย์มานุษยวิทยาสิรินธร (องค์การมหาชน)
• Wikisource
• Ruern Thai
9. http://vajirayana.org
Book Digitisation
I. ไฟล์ pdf ที่เป็นข้อความ (Highlight ได้)
- copy/paste หรือ pdf2text tool
- Find/Replace encoded or
unrecognised symbols
- Use VBA script to replace symbols
cannot find/replace
II. ไฟล์สแกนหรือรูปถ่าย
- OCR with Tesseract
- Output files in .txt or .docx
10. http://vajirayana.org
1. Images Preprocessing
- Convert pdf to jpg
- Page split and clean up
2. OCR
- Tesseract 4.0
- Output files in .txt, .docx
3. Proof Correction
- Autocorrection scripts
- Human proofread
- Format html
OCR Workflow
11. http://vajirayana.org
1. Image Preprocessing
• Better OCR result images 300dpi, clear, black and white,
no watermark, no book border.
• Convert pdf to jpg/tif: ImageMagick convert
• ImageMagick textcleaner (crop, sharpening, b&w,
rotate, clean up)
• ScanTailor (split pages and clean up)
12. http://vajirayana.org
OCR Engines
ABBYY FineReader Tesseract 4.0 OCRopus
OS Windows, Mac OS X Windows, Linux, Mac OS X FreeBSD, Linux, Mac OS X
User Interface
GUI (with preprocessing,
language detection and
output formats)
CLI CLI
Glyph Training Limited Required large dataset Tools provided
License Commercial, Closed source Apache License v2.0 Apache License v2.0
Developed by A Russian based company Google
German Research Centre for
Artificial Intelligence
Thai language Yes Yes No
13. http://vajirayana.org
Tesseract Open Source OCR Engine
• Originally of HP, Since 2006 it has been developed by Google.
• Can recognise more than 100 languages (incl. Thai)
• Result in beta version 4.0 (LSTM based) is much better than stable
version 3 for Thai language.
• Better quality of image, better results.
• Can be trained to recognise other languages.
• Has basic command line usage with API for developers.
• https://github.com/tesseract-ocr/tesseract
14. http://vajirayana.org
2. Running Tesseract OCR
• Run Tesseract command for all page images in a folder
- To fix result with extra spaces use option preserve_interword_spaces=1
tesseract thatest.jpg thatest -l tha --psm 1 --oem 1 -c
preserve_interword_spaces=1 txt
- To fix extra lines from top vowels, increase line height by option textord_min_linesize=3.25
tesseract IMG_5339_L.tif IMG_5339 -l tha --psm 1 --oem 1 -c
textord_min_linesize=3.25 txt
- Multiple languages
tesseract 186.jpg 186 -l tha+eng --psm 1 --oem 1 -c textord_min_linesize=3.25
txt
For more Tesseract command options
tesseract —print-parameters
15. http://vajirayana.org
3. Proof Correction
• Auto-correction MS Word VBA scripts: Regular
Expressions and recorded find/replace words
• Manual proofread on 1st edition book, record replace
words for future autocorrection.
• Annotate page number
16. http://vajirayana.org
ข้อมูลด้านเทคนิกเวปไซต์
• CMS: Drupal 7 with built-in Book Module
• html2book: Automatic break chapters based on
Word heading style
• Google Custom Search
• Formatting text: footnotes (bigfootJS), วันขึ้นวันแรม (CSS),
มาตราเงินไทย (+), ปีกกาพ่วงบรรทัด ( } ) (MathJax)