Vajirayana Digital Library Introduction

การพัฒนาห้องสมุดดิจิทัลวชิรญาณ
มณฑล กาญจโนฬาร
คณะอักษรศาสตร์ จุฬาลงกรณ์มหาวิทยาลัย, ตุลาคม ๒๕๖๑
http://vajirayana.org | vajirayana.org@gmail.com

http://vajirayana.org
ข้อมูลโครงการ
• เริ่มต้นเมื่อ พ.ศ. ๒๕๕๗ โดยการจัดทำและเผยแพร่หนังสือสำคัญ ๖๐ เล่มเพื่อเฉลิมพระเกียรติิสมเด็จพระ
เทพรัตนราชสุดาฯ สยามบรมราชกุมารีในโอกาส ฉลองพระชนมายุ ๕ รอบ ๒ เมษายน ๒๕๕๘
• โดยได้รับความอนุเคราะห์จากสำนักวรรณกรรมและประวัติศาสตร์ กรมศิลปากรคัดเลือกหนังสือและเอื้อเฟื้อ
ต้นฉบับ
• ผู้ร่วมจัดทำโครงการ ๓ คน
• ปัจจุบันมีหนังสือเผยแพร่ในโครงการ ๑๘๘ เรื่อง
• กลุ่มเป้าหมายของโครงการ คือ นักเรียน นักศึกษา (น้อยกว่า ๒๔ ปี)
• เผยแพร่ในรูปแบบข้อความแทนรูปภาพ เพื่อให้เข้าถึงได้ง่ายในทุกอุปกรณ์ สามารถค้นคำได้

รูปแบบข้อความและรูปภาพ
• โหลดเร็ว
• ใช้งานง่ายในทุกอุปกรณ์ เช่น โทรศัพท์มือถือ
• ผู้พิการทางสายตาสามารถใช้งานผ่านเครื่องอ่าน
ออกเสียง
• ค้นคำได้ง่าย สามารถสร้างลิงก์ได้
• มีความถูกต้อง สมบูรณ์
• ใช้เวลาจัดทำน้อย
• อ้างอิงโดยใช้เลขหน้า
รูปแบบข้อความ รูปภาพ
เหมาะสำหรับการอ่านทั่วไปและการค้นคว้า เหมาะสำหรับการอนุรักษ์ การอ้างอิง และ
ใช้ในชั้นเรียน

• ร้อยกรอง บทละครเรื่องรามเกียรติ์ ในรัชกาลที่ ๑ บทละครเรื่องอิเหนา ในรัชกาลที่ ๒, ผลงาน
ทั้งหมดของสุนทรภู่ที่พบในปัจจุบัน, เสภาเรื่องขุนช้างขุนแผน, สมุทรโฆษคำฉันท์, พระนลคำหลวง,
ประชุมเพลงยาว, กลอนสวด
• ประวัติศาสตร์ ธรรมเนียมประเพณี พระราชพงศาวดารกรุงรัตนโกสินทร์รัชกาลที่ ๑ ถึงรัชกาลที่ ๕,
ประชุมพระราชนิพนธ์/ประกาศรัชกาลที่ ๔, ไกลบ้าน, พระราชพิธีสิบสองเดือน
• ศาสนา ไตรภูมิกถา, มหาชาติคำหลวง
• นวนิยายแปลอิงประวัติศาสตร์จีน สามก๊ก เลียดก๊ก ไซฮั่น ซ้องกั๋ง
หนังสือที่เผยแพร่แล้ว

• หนังสือหมวดอื่นๆ แพทยศาสตร์สงเคราะห์, ตำราโหร, ตำรากับข้าว, หนังสือเด็กและแบบเรียน
• วรรณกรรมท้องถิ่น พื้นเวียงจันทน์ นายดั่นวันคาร โคลงอุสาบารส...
• งานพระราชนิพนธ์พระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว, งานพระนิพนธ์สมเด็จกรมพระยาดำรง
ราชานุภาพ, สาส์นสมเด็จ
• นวนิยายและเรื่องสั้นไทยจากสมัยรัชกาลที่ ๗
หนังสือในระหว่างจัดทำ

• ได้รับความอนุเคราะห์จากสำนักวรรณกรรมฯ ช่วยคัดเลือก ๖๐ เล่มแรก (รามเกียรติ์, อิเหนา, ขุนช้าง
ขุนแผน)
• หนังสือที่ได้รับการยกย่องจากวรรณคดีสโมสร, ๑๐๐ เล่มที่คนไทยควรอ่านโดย สกว. (โคลงกลอนของ
ครูเทพ, หนังสือแสดงกิจจานุกิจ)
• หนังสือที่กล่าวถึงในหนังสือที่จัดทำ (โคลงนิราศหริภุญชัย, จดหมายหลวงอุดมสมบัติ)
• หนังสือแนะนำจากยูสเซอร์ (สรรพสิทธิ์คำฉันท์, ไตรภูมิกถา, ประชุมปกรณัม, โคลงนิราศพระพิพิธสาลี)
• หนังสือจัดพิมพ์โดยกรมศิลปากร (ประชุมสุภาษิตสอนหญิง, ประชุมวรรณคดีเรื่องพระพุทธบาท)
• พระราชนิพนธ์ในรัชกาลที่ ๒, ผลงานทั้งหมดของสุนทรภู่
การคัดเลือกหนังสือในโครงการ

สถิติการใช้งาน
• ในเดือนสิงหาคม 2561 มีผู้ใช้งาน 72,000 ราย โดย 43% อายุระหว่าง 18-24 ปี
• 59% mobile, 38% desktop, 3% tablet.

แหล่งข้อมูลหนังสือไทยออนไลน์
• full library features

• TH/FR/EN
• larger collection

• less features

• TH books only
• text format

• incomplete books
• สำนักงานวิทยทรัพยากร จุฬาลงกรณ์มหาวิทยาลัย
• หอสมุดมหาวิทยาลัยธรรมศาสตร์
• หอสมุดมหาวิทยาลัยเชียงใหม่
• คลังข้อมูลดิจิทัล กรมศิลปากร
• หอสมุดแห่งชาติ ส่วนภูมิภาค
• ศูนย์มานุษยวิทยาสิรินธร (องค์การมหาชน)
• Wikisource

• Ruern Thai

Book Digitisation
I. ไฟล์ pdf ที่เป็นข้อความ (Highlight ได้)
- copy/paste หรือ pdf2text tool

- Find/Replace encoded or
unrecognised symbols

- Use VBA script to replace symbols
cannot ﬁnd/replace
II. ไฟล์สแกนหรือรูปถ่าย
- OCR with Tesseract

- Output ﬁles in .txt or .docx

1. Images Preprocessing
- Convert pdf to jpg
- Page split and clean up
2. OCR
- Tesseract 4.0

- Output ﬁles in .txt, .docx
3. Proof Correction
- Autocorrection scripts

- Human proofread

- Format html
OCR Workﬂow

1. Image Preprocessing
• Better OCR result images 300dpi, clear, black and white,
no watermark, no book border.

• Convert pdf to jpg/tif: ImageMagick convert

• ImageMagick textcleaner (crop, sharpening, b&w,
rotate, clean up)
• ScanTailor (split pages and clean up)

OCR Engines
ABBYY FineReader Tesseract 4.0 OCRopus
OS Windows, Mac OS X Windows, Linux, Mac OS X FreeBSD, Linux, Mac OS X
User Interface
GUI (with preprocessing,
language detection and
output formats)

CLI CLI
Glyph Training Limited Required large dataset Tools provided
License Commercial, Closed source Apache License v2.0 Apache License v2.0
Developed by A Russian based company Google
German Research Centre for
Artiﬁcial Intelligence
Thai language Yes Yes No

Tesseract Open Source OCR Engine
• Originally of HP, Since 2006 it has been developed by Google.

• Can recognise more than 100 languages (incl. Thai)

• Result in beta version 4.0 (LSTM based) is much better than stable
version 3 for Thai language.

• Better quality of image, better results.

• Can be trained to recognise other languages.

• Has basic command line usage with API for developers.

• https://github.com/tesseract-ocr/tesseract

2. Running Tesseract OCR
• Run Tesseract command for all page images in a folder

- To ﬁx result with extra spaces use option preserve_interword_spaces=1

tesseract thatest.jpg thatest -l tha --psm 1 --oem 1 -c
preserve_interword_spaces=1 txt
- To ﬁx extra lines from top vowels, increase line height by option textord_min_linesize=3.25

tesseract IMG_5339_L.tif IMG_5339 -l tha --psm 1 --oem 1 -c
textord_min_linesize=3.25 txt
- Multiple languages

tesseract 186.jpg 186 -l tha+eng --psm 1 --oem 1 -c textord_min_linesize=3.25
txt
For more Tesseract command options

tesseract —print-parameters

3. Proof Correction
• Auto-correction MS Word VBA scripts: Regular
Expressions and recorded ﬁnd/replace words

• Manual proofread on 1st edition book, record replace
words for future autocorrection.

• Annotate page number

ข้อมูลด้านเทคนิกเวปไซต์
• CMS: Drupal 7 with built-in Book Module

• html2book: Automatic break chapters based on
Word heading style

• Google Custom Search

• Formatting text: footnotes (bigfootJS), วันขึ้นวันแรม (CSS),
มาตราเงินไทย (+), ปีกกาพ่วงบรรทัด ( } ) (MathJax)

ข้อสังเกตและปัญหาที่พบ
• ไฟล์สแกนไม่ครบหน้า

• หนังสือเก่ามักมีคำเดียวกันใช้ตัวสะกดหลายรูป

• หนังสือฉบับพิมพ์ใหม่มีข้อความขาดหายทีละ 1-2 บรรทัด

• MS Word ไม่รู้จักคำเก่า

• Search Engine ไม่เข้าใจคำเก่า

งานพัฒนาด้านเทคนิก
• Faster and more accurate workﬂow: Tesseract model
training.

• Library Features: advanced search and indexing.

• UX Improvement: bookmarks, text highlights and notes.

iขอบคุณครับ

Resources
• Tesseract OCR [https://github.com/tesseract-ocr]

Command Line Usage [https://github.com/tesseract-ocr/tesseract/wiki/
Command-Line-Usage]

• ImageMagick [https://www.imagemagick.org]

• ImageMagick textcleaner [http://www.fmwconcepts.com/imagemagick/
textcleaner/index.php]

• Convert pdf ﬁles: XpdfReader [http://www.xpdfreader.com/]

• ScanTailor [http://scantailor.org/]

• Footnotes: bigfoot [www.bigfootjs.com/]

Vajirayana Digital Library Introduction

Recommended

Recommended

More Related Content

Similar to Vajirayana Digital Library Introduction

Similar to Vajirayana Digital Library Introduction (14)

More from Korakot Chaovavanich

More from Korakot Chaovavanich (6)

Vajirayana Digital Library Introduction