Session1 03.hsian-an wang

•

0 likes•98 views

This document discusses using text models to improve the accuracy of optical character recognition (OCR) on Chinese rare books. It conducted experiments using n-gram, backward/forward n-gram, and LSTM models on OCR data from ancient medicine books. The backward and forward 4-gram model achieved the highest correction rate at 97.57%. Mixing the LSTM 6-gram model with the OCR's top 5 candidates and probability of the top candidate further improved accuracy to 97.71%, demonstrating that combining text models with OCR probabilities can better correct OCR errors than text models alone. In conclusion, text models are effective for increasing OCR accuracy on rare books, with backward/forward 4-gram and LSTM 6-gram

Towards a Higher Accuracy of Optical
Character Recognition of Chinese Rare
Books in Making Use of Text Model
Hsiang-An Wang
Academia Sinica
Center for Digital Cultures

Limitation (Missing and Extra Word)
OCR Original OCR Original
3

Experiment: Data Collection
• Training dataset: 187 ancient medicine books
from the Scripta Sinica Database (about 40
million words)
• Testing dataset: 1 relevant ancient medicine
book named “ ” with a total of
185,000 words
• The OCR results are about 180,000 words
correct and about 5000 incorrect words,
which means the correct rate is about 97.3 %
4

Experiment: Building a N-gram Model
• Relied on the sequence of words in the
training dataset, and thus we picked the
highest frequency of output.
• " "
– 2-gram: input to predict " "
– 3-gram: input predict " "
– 4-gram: input predict " "
– ...
5

Experiment: Building a
Backward and Forward N-gram Model
• Relied on the sequence of backward and forward
words in the training dataset, and thus we picked the
highest frequency of output.
• Since the backward and forward N-gram are divided
into two different sets of N-gram, therefore, the
model can be used when the same word is found
afterwards.
• " "
– Backward 4-gram: input to predict " "
– Forward 4-gram: input to predict " "
6

Experiment: Building a LSTM Model
• Used the Word2vec to project text into the vector
space with 200 dimension
• Used LSTM with three layers of neural network
• Picked the highest score of softmax layer to
predict the word
• " "
– LSTM 2-gram: input to predict " "
– LSTM 3-gram: input to predict " "
– LSTM 4-gram: input to predict " "
7

The Modification of Correctness Rate
in N-gram Model
• 7-gram can achieve the best correction rate
8

The Modification of Correctness Rate in
Backward and Forward N-gram Model
• Backward and Forward 4-gram can achieve
the best correction rate
9

The Modification of Correctness Rate
in LSTM Model
• LSTM 6-gram can achieve the best correction
rate
•
10

Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.35% 13.06% 97.49%
LSTM 6-gram 0.1% 7.33% 97.5%
BF 4-gram 0.08% 9.54% 97.57%
Comparison of 7-gram, LSTM 6-gram
and BF 4-gram Text Models
• Backward and Forward 4-gram has the best
performance, with the lowest modification error
result and the highest correct results
11

Three Text models with
OCR Top 5 Candidate Words
• The OCR software we use is a Convolution Neural
Network model and to calculate the probability of
classification through softmax function
• When the probability of OCR Top 1 is lower than 95%,
it determines the word might be wrong and will use
mixed model
• Pick the word that has the highest score of the text
model also appeared in OCR Top 5 candidate words
12

Conclusion: Using Text Model
• N-gram, backward and forward N-gram or LSTM N-
gram text model can increase the ratio of accuracy of
OCR
• Backward and Forward 4-gram model has the lowest
modification error result and the highest correct
result
14

Conclusion: Mixing Text Models with
the Probability of OCR
• By mixing rules of OCR Top 5 candidate words
and probability of Top 1 with text model, it can
archive better result than using text model only
• Mixing the LSTM 6-gram with the probability of
OCR model has the highest correct results
15

This paper proposes using recurrent connectionist language models to improve LSTM-based Arabic text recognition in videos. It trains RNN and RNNME language models on a large Arabic text corpus and integrates them into an LSTM-CTC optical character recognition system using a modified beam search decoding scheme. Experimental results show the connectionist language models outperform n-gram models, improving word recognition rate by over 16% compared to the baseline model without a language model. The full system also outperforms a commercial OCR engine by over 35% word recognition rate.

Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit

Shubham Verma

The document provides an overview of parallelism techniques in HMM-DNN based automatic speech recognition systems as implemented in the Kaldi ASR toolkit. It discusses various stages of a typical ASR pipeline that can benefit from parallelization including feature extraction, acoustic modeling using neural networks, language modeling, and decoding. Specific examples mentioned include using GPUs to speed up MFCC feature extraction by 97 times and neural network training by 10-1000 times. Advanced decoding algorithms like Viterbi beam search and A* search are also discussed along with their GPU implementations providing significant speedups.

Language models

Maryam Khordad

This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.

Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...

Databricks

This document summarizes an approach for joint optimization of AutoML and transfer learning. It discusses challenges with using AutoML for transfer learning due to limitations on the search space from pretrained models and inability to reuse models across datasets. The proposed approach uses AutoML to search for neural network architectures and hyperparameters based on pretrained models. It then fine-tunes the selected models on target datasets, achieving better accuracy and stability than traditional fine-tuning or standalone AutoML. Experimental results on image classification tasks demonstrate the advantages of the joint optimization approach.

Ekush net

SalithRahman

This document summarizes a research project that used a convolutional neural network (CNN) model to recognize Bangla handwritten characters. The proposed Ekush-Net model was able to recognize 50 basic characters, 10 digits, 10 modifiers, and 52 compound characters with 97.73% accuracy on the Ekush test dataset and 95.01% accuracy on the CMater-DB validation dataset. The document describes the CNN model architecture and training process used to achieve state-of-the-art performance in Bangla handwritten character recognition. Future work could explore using more data and larger CNN models to potentially improve recognition accuracy.

Story generation-Sarah Saneei

SRah Sanei

This presentation summarizes a neural entity-based text generation model called ENGEN. It combines three sources of contextual information - context from entities, content of the current sentence, and context from the previous sentence. The model assigns vector representations to entities that are updated each time an entity is mentioned. It was evaluated on three tasks: mention generation, pairwise sentence selection, and human evaluation of sentence generation. For mention generation, ENGEN performed better than baselines by leveraging entity representations. For sentence selection, S2SA performed best due to importance of local context. In human evaluation, ENGEN was rated better than S2SA for 27 out of 50 passages due to its ability to model coreference and generate coherent new entities.

COSMOS1_Scitech_2014_Ali

MDO_Lab

This document presents a new 3-level surrogate model selection approach that simultaneously selects the best model type, kernel function, and hyperparameters. It uses Regional Error Estimation of Surrogates (REES) to quantify the median and maximum errors of different surrogates. The approach is tested on benchmark problems using radial basis function, Kriging, and support vector regression models. Results show at least 60% reduction in error compared to traditional methods, demonstrating the effectiveness of the new approach. Future work will apply the method to more complex problems and develop an online platform for collaborative surrogate model selection.

Prediction of pKa from chemical structure using free and open source tools

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The ionization state of a chemical, reflected in pKa values, affects lipophilicity, solubility, protein binding and the ability of a chemical to cross the plasma membrane. These properties govern the pharmacokinetic parameters such as absorption, distribution, metabolism, excretion and toxicity and thus pKa is a fundamental chemical property and is used in many models of chemical toxicity. Experimentally determining pKa is not feasible for high-throughput assays. Predicting pKa is challenging and existing models have been developed only using restricted chemical space (e.g., anilines, phenols, benzoic acids, primary amines) and lack of a generalized model impedes ADME modeling. No free and open source models exist for heterogeneous chemical classes, however, several proprietary programs exist. In this work, pKa open data bundled with DataWarrior (http://www.openmolecules.org/) were used to develop predictive models for pKa. After data cleaning, there were ~3100 and ~3900 monoprotic chemicals with an acidic or basic pKa, respectively. 1D and 2D chemical descriptors (AlogP, Topological polar surface area, etc) in addition to 12 fingerprints (presence or absence of a chemical group) were generated using PaDEL software. Three datasets were used: acidic, basic and acidic and basic combined. 13 datasets were examined, the 1D/2D descriptors and 12 fingerprints. Using the Extreme Gradient Boosting algorithm showed that the MACCS and Substructure Count fingerprints yielded the best results, with models showing an R-Squared of ~0.78 and a RMSE of 1.42. Recently, Deep Learning models have showed remarkable progress in image recognition and natural language processing. To determine if the Deep Learning algorithms would increase model performance we examined the datasets and found that the Deep Learning models were somewhat superior than Extreme Gradient Boosting with an R-Squared of ~0.80 and an RMSE of ~1.38. This work does not reflect U.S. EPA policy.

This project explores sentiment analysis, a technique used to understand emotions expressed in text. We delve into the world of movie reviews, applying sentiment analysis techniques to uncover audience sentiment towards various films. This can provide valuable insights for filmmakers, studios, and moviegoers alike. For more analysis and artificial intelligence related content visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/

Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...

tsysglobalsolutions

Grammatical Error Correction with Neural Reinforcement Learning

Masahiro Kaneko

The document discusses problems with using maximum likelihood estimation (MLE) for training neural encoder-decoder models for grammatical error correction (GEC). Specifically, MLE focuses on word-level accuracy and the model cannot recover if it predicts an incorrect word. To address this, the author proposes using a neural reinforcement learning (NRL) approach where the model directly optimizes expected reward on the training data rather than word accuracy. The NRL model updates its parameters based on rewards from predicted outputs rather than references. Evaluation shows the NRL model outperforms MLE and other baselines on automated and human evaluations.

Text summarization

prateek khandelwal

Following are the questions which I tried to answer in this ppt What is text summarization. What is automatic text summarization? How it has evolved over the time? What are different methods? How deep learning is used for text summarization? business application in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained. In the last section business application of each one is highlighted

Knucth Morris and pratt_presentation.pptx

siddharthyou29

This presentation summarizes the Knuth-Morris-Pratt (KMP) algorithm for efficient pattern searching. It begins by explaining the limitations of naive searching approaches. It then introduces the KMP algorithm, which is based on the concepts of proper prefix and suffix to avoid unnecessary character comparisons. The presentation includes a flow chart and code implementation of the KMP algorithm. It concludes by discussing the advantages of KMP in improving time complexity for pattern searching and its applications in areas like text processing, bioinformatics, and data analysis.

NLP Classifier Models & Metrics

Sanghamitra Deb

Genetic algorithm guided key generation in wireless communication (gakg)

IJCI JOURNAL

In this paper, the proposed technique use high speed stream cipher approach because this approach is useful where less memory and maximum speed is required for encryption process. In this proposed approach Self Acclimatize Genetic Algorithm based approach is exploits to generate the key stream for encrypt / decrypt the plaintext with the help of key stream. A widely practiced approach to identify a good set of parameters for a problem is through experimentation. For these reasons, proposed enhanced Self Acclimatize Genetic Algorithm (GAKG) offering the most appropriate exploration and exploitation behavior. Parametric tests are done and results are compared with some existing classical techniques, which shows comparable results for the proposed system.

Handwriting recognition

Maeda Hanafi

L05 language model_part2

ananth

presentation.ppt

MadhuriChandanbatwe

This document discusses different methods for document classification using natural language processing and deep learning. It presents the steps for document classification using machine learning, including data preprocessing, feature engineering, model selection and training, and testing. The document tests several models on a news article dataset, including naive bayes, logistic regression, random forest, XGBoost, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). CNNs achieved the highest accuracy at 91%, and using word embeddings provided additional improvements. While classical models provided good accuracy, neural network models improved it further.

MM - KBAC: Using mixed models to adjust for population structure in a rare-va...

Golden Helix Inc

Confounding from population structure, extended families and inbreeding can be a significant issue for burden and kernel association tests on rare variants from next generation DNA sequencing. An obvious solution is to combine the power of a mixed model regression analysis with the ability to assess the rare variant burden using methods such as KBAC or CMC. Recent approaches have adjusted burden and kernel tests using linear regression models; this method adjusts for the relatedness of samples and includes that directly into a logistic regression model. This webcast will focus on the details of bringing Mixed Model Regression and KBAC together, including: deriving an optimal logistic mixed model algorithm for calculating the reduced model score, how the kinship or random effects matrix should be specified, and how it all comes together into one algorithm. Results from applying the method to variants from the 1000 Genomes project will also be presented and compared to famSKAT.

Basic Local Alignment Search Tool (BLAST)

Asiri Wijesinghe

The document describes the BLAST algorithm for comparing biological sequences. BLAST stands for Basic Local Alignment Search Tool. It allows for fast comparison of a query sequence against large databases. BLAST uses heuristics to find locally similar regions between sequences and scores alignments based on identities without considering gaps. This rapid approximation allows BLAST to be applied to search large databases on common computers, providing a significant improvement over previous algorithms. The document outlines the methods used in BLAST, including compiling high-scoring words from the query, scanning the database for hits, and extending hits to determine significant alignments. It also discusses evaluating the statistical significance of results and how parameters like word length and score thresholds can impact BLAST's speed and accuracy.

ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015

RIILP

The document describes a statistical automatic post-editing (APE) system that aims to improve machine translation output with minimal human effort. The system uses hierarchical phrase-based statistical machine translation trained on machine translation output and reference human translations. The system first cleans and preprocesses data, generates improved word alignments, and then performs hierarchical phrase-based SMT to output post-edits. Evaluation shows the APE system outperforms the baseline machine translation according to both automatic metrics and human evaluation, requiring less post-editing effort.

K-Gram as a Determinant of Plagiarism Level in Rabin-Karp Algorithm

Universitas Pembangunan Panca Budi

Rabin-Karp is one of the algorithms used to detect the similarity levels of two strings. In this case, the string can be either a short sentence or a document containing complex words. In this algorithm, the plagiarism level determination is based on the same hash value on both documents examined. Each word will form K-Gram of a certain length. The K-Gram will then be converted into a hash value. Each hash value in the source document will be compared to the hash value in the target document. The same number of hashes is the level of plagiarism created. The length of K-Gram is the determinant of the plagiarism level. By determining the proper length of K-Gram, it produces the accurate result. The results will vary for each K-Gram value.

Canonical Formatted Address Data

danielschulz2005

Concept to clean noisy address data and correct spelling errors. The challenge is to clean noisy data from free texts in table format. This is still a matter of research. But there are some approaches that can lead to the desired outcome. Therefore we build a Language Model that „understands“ the language used based on probabilities. This will be used to improve misspelled inputs and brings it into a canonical format. The Language Model consist of a Lucene word index and a corresponding MultiMap. To improve performance Bloom Filters and caches are considered useful. Additionally we take a classifier to strip all the not needed descriptions from the address information texts. The NeeText instances holds valuable information on texts. NEE means Native Expression Enhancement where „native“ refers to any natural language.

Canonical Formatted Address Data

danielschulz2005

tmptmptmp123.pptx

ssuser893445

Generative Pseudo Labeling is an unsupervised domain adaptation method for dense retrieval models. It generates pseudo queries for target domain passages using a query generator. These queries and retrieved negative passages are used to train a dense retriever via a margin-based loss function. The method outperforms existing domain adaptation baselines on six domain-specific retrieval tasks, with performance gains from additional pre-training and optimal hyperparameter choices like query generation count. However, it requires a complex training setup that future work could aim to simplify.

Demystifying Machine Learning

Ayodele Odubela

This document provides an overview of a machine learning workshop. It begins with introducing the presenter and their background. It then outlines the topics that will be covered, including machine learning applications, different machine learning algorithms like decision trees and neural networks, and the necessary math foundations. It discusses the differences between supervised, unsupervised, and reinforcement learning. It also covers evaluating models and challenges like overfitting. The goal is to demystify machine learning concepts and algorithms.

DeepWriting: Making Digital Ink Editable via Deep Generative Modeling

ivaderivader

This paper proposes a conditional variational recurrent neural network (C-VRNN) model to make digital ink editable by disentangling handwriting style and text content. The C-VRNN model allows for conditional handwriting generation, content-preserving style transfer, and word-level editing such as spell checking. A new dataset of handwritten text with character-level annotations is also collected to train and evaluate the model. Preliminary user evaluations suggest the model can effectively beautify handwriting.

Session6 01.helmut schmid

Session1 03.hsian-an wang

Recommended

Recommended

More Related Content

Similar to Session1 03.hsian-an wang

Similar to Session1 03.hsian-an wang (20)

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Recently uploaded

Recently uploaded (20)

Session1 03.hsian-an wang