2. Architecture
It is passed a buffer of mystery text, of arbitrary size, from the user environment,
and it returns a buffer of information indicating the natural language in which the
text is written.
In a production environment, a Language Identifier, or several of them, can be
enslaved to one or more schedulers, just like ASK and TransMatic. Users who
want a text translated, but who don't know the source language, can then send the
text for language identification rather than translation.
3. Architecture
The Language Identifier is based on mathematical source-language models that have
been developed in the eld of cryptanalysis for help in breaking ciphers. The prototype
has shown that these mathematical models can be successfully modified, reinterpreted
and reapplied to the problem of natural-language identification.
Automatic language identification is possible because alphabetically written natural
languages are highly non-random and consistent in the letters and letter sequences
that they use; and equally important, different languages differ consistently in the
letters and letter sequences used. In other words, each language uses a unique or
very characteristic alphabet, and letters in the alphabet appear with surprisingly
consistent frequencies in any statistically significant text; in addition, the frequencies of
occurrence of sequences of two, three, four,
4. Architecture
five and more letters are characteristically stable within, and diverse among,
natural languages.
Where text is correctly spelled, even inside a computer, the presence or absence
of characteristic letters (or the integer mappings) is often quite sufficient for
language identification. When accentuation is lacking, the relative frequencies of
individual letters are often revealing.
5. Architecture
The input image taken from the user shall be passed through the image
acquisition model, its text derived from the image and the output given to the
language identification model for further processing.
The OCR module shall have a preprocessing submodule and a segmentation
module. The LI module shall have a preprocessing submodule, trigram segmenter
and corpus matcher that shall allow for the pattern matching between the ttrigrams
of the given text and the language corpus trigrams to find the winning trigram.
14. Module Implementation
When any input image is uploaded by the user, the application shall first perform so pre processing on the image to make it
easier to identify the text in it.
Gaussian Blur:
It calculates the modified pixel value for each pixel in the image using the Gaussian function,
G(x, y) = 1 √ 2πσ2 e − x 2+y 2 2σ2 (2)
where x shows how far the point is from the origin in the x axis, y shows how far the point is from the origin on y axis, and σ
is the distributed Gaussian deviation. This function smooths the image and eliminates noise.
Binarization:
In layman’s terms Binarization means converting a coloured image into an image which consists of only black and white
pixels (Black pixel value=0 and White pixel value=255). As a basic rule, this can be done by fixing a threshold (normally
threshold=127, as it is exactly half of the pixel range 0–255). If the pixel value is greater than the threshold, it is considered
as a white pixel, else considered as a black pixel.
Adaptive Thresholding: This method gives a threshold for a small part of the image depending on the characteristics of its
locality and neighbours i.e there is no single fixed threshold for the whole image but every small part of the image has a
different threshold depending upon the locality and also gives smooth transition.
15. Module Implementation
2) Skew Correction:
While scanning a document, it might be slightly skewed (image aligned at a certain angle with horizontal)
sometimes. While extracting the information from the scanned image, detecting & correcting the skew is crucial.
In this method, First, we’ll take the binary image, then
● project it horizontally (taking the sum of pixels along rows of the image matrix) to get a histogram of pixels
along the height of the image i.e count of foreground pixels for every row.
● Now the image is rotated at various angles (at a small interval of angles called Delta) and the difference
between the peaks will be calculated (Variance can also be used as one of the metrics). The angle at which
the maximum difference between peaks (or Variance) is found, that corresponding angle will be the Skew
angle for the image.
● After finding the Skew angle, we can correct the skewness by rotating the image through an angle equal to
the skew angle in the opposite direction of skew.
3) Noise Removal:
The main objective of the Noise removal stage is to smoothen the image by removing small dots/patches which
have high intensity than the rest of the image. Noise removal can be performed for both Coloured and Binary
images.
16. Module Implementation
Word Level Segmentation: At this level of segmentation, we are provided with an image containing a single line (segmented in
the previous step) which consists of a sequence of words. The objective of Word Level Segmentation is to segment the image
into words.
If we Vertically project the binary image,
● Columns that represent the text have high No.of foreground pixels, which correspond to higher peaks in the histogram.
● Columns that represent the gaps in-between the words have high No.of background pixels, which correspond to lower
peaks in the histogram.
● Columns that correspond to lower peaks in the histogram can be selected as the segmenting lines to separate the words.
For segmenting words, lower peaks should be selected in such a way that they should span through a certain width (threshold).
This is because we’ll find lower peaks which correspond to the gaps between disconnected characters within a word, which we
are not interested in. As we know, the gaps between the words are greater than the gaps between the characters within a word,
the threshold should be chosen in such a way that neglects the thin gaps between the characters within the words.
17. Module Implementation
For our language identification problem, we will be using character 3-grams/trigrams (i.e. sets of 3 consecutive characters).
We see an example of how sentences can be vectorised using trigrams. Firstly, we get all the trigrams from the sentences.
To reduce the feature space, we take a subset of these trigrams. We use this subset to vectorise the sentences.
The process for creating our trigram feature matrix is similar. The steps taken are:
● Using the training set, we select the 200 most common trigrams from each language
● Create a list of unique trigrams from these trigrams. The languages share a few common trigrams and so we end up
with a 663 unique trigrams
● Create a feature matrix, by counting the number of times each trigram occurs in each sentence
We now have the datasets in a form ready to be used to train our Neural Network. Softmax activation function is used in the
model’s output layer. This means we have to transform our list of target variables into a list of one-hot encodings.
18. Module Implementation
Before choosing the final model structure, I did a bit of hyperparameter tuning. I
varied the number of nodes in the hidden layers, the number of epochs and the
batch-size. The hyperparameter combination that achieved the highest accuracy
on the validation set was chosen for the final model.
The final model has 3 hidden layers with 500, 500 and 250 nodes respectfully.
The output layer has 6 nodes, one for each language. The hidden layers all have
ReLU activation functions and, as mentioned, the output layer has a softmax
activation function. We train this model using 4 epochs and a batch size of 100.
Using our training set and one-hot encoded target variable list, we train this DDN
in the code below. In the end, we achieve a training accuracy of 99.70%.
23. References
Erik Sterneberg. Language Identification of Person Names Using Cascaded SVMs. Bachelor’s Thesis, Uppsala
University, Uppsala, 2012. Marija Stupar, Tereza Juri´c, and Nikola Ljubeˇsi´c.
Language Identification of Web Data for Building Linguistic Corpora. In Proceedings of the 3rd International Conference
on The Future of Information Sciences (INFuture 2011), pages 365–372, Zagreb, Croatia, 2011. Izumi Suzuki, Yoshiki
Mikami, Ario Ohsato, and Yoshihide Chubachi.
A Language and Character Set Determination Method Based on n-gram Statistics. ACM Transactions on Asian
Language Information Processing (TALIP), 1(3):269–278, September 2002. Hidayet Tak¸cı and Ekin Ekinci. Minimal
Feature Set in Language Identification and Finding Suitable Classification Method with it.
Procedia Technology, 1:444–448, January 2012. Hidayet Tak¸cı and Tunga G¨ung¨or. A High Performance Centroid-
based Classification Approach for Language Identification. Pattern Recognition Letters, 33(16):2077–2084, December
2012.
Liling Tan, Marcos Zampieri, Nikola Ljubeˇsi´c, and J¨org Tiedemann. Merging Comparable Data Sources for the
Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and
Using Comparable Corpora (BUCC), Reykjavik, Iceland, 2014. William John Teahan.
24. 1. Grother, Patrick J. "NIST special database 19." Handprinted forms and characters database,
National Institute of Standards and Technology (1995).
2. LeCun, Yann, Corinna Cortes, and Christopher JC Burges. "The MNIST database of handwritten digits, 1998." http://yann. lecun. com/exdb/mnist 10 (1998): 34.
3. Mouchere, Harold, et al. "Crohme2011: Competition on recognition of online handwritten
mathematical expressions." 2011 international conference on document analysis and recognition. IEEE, 2011.
4. Acharya, Shailesh, Ashok Kumar Pant, and Prashnna Kumar Gyawali. "Deep learning based
large scale handwritten Devanagari character recognition." 2015 9th International Conference
on Software, Knowledge, Information Management and Applications (SKIMA). IEEE, 2015.
5. Zhou, Shusen, Qingcai Chen, and Xiaolong Wang. "HIT-OR3C: an opening recognition corpus
for Chinese characters." Proceedings of the 9th IAPR International Workshop on Document
Analysis Systems. ACM, 2010.
6. Wang, K. and Belongie, S., 2010, September. Word spotting in the wild. In European Conference on Computer Vision (pp. 591-604). Springer, Berlin, Heidelberg.
7. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B. and Ng, A.Y., 2011. Reading digits in
natural images with unsupervised feature learning