SlideShare a Scribd company logo
1 of 24
ASQRM II
Architecture
It is passed a buffer of mystery text, of arbitrary size, from the user environment,
and it returns a buffer of information indicating the natural language in which the
text is written.
In a production environment, a Language Identifier, or several of them, can be
enslaved to one or more schedulers, just like ASK and TransMatic. Users who
want a text translated, but who don't know the source language, can then send the
text for language identification rather than translation.
Architecture
The Language Identifier is based on mathematical source-language models that have
been developed in the eld of cryptanalysis for help in breaking ciphers. The prototype
has shown that these mathematical models can be successfully modified, reinterpreted
and reapplied to the problem of natural-language identification.
Automatic language identification is possible because alphabetically written natural
languages are highly non-random and consistent in the letters and letter sequences
that they use; and equally important, different languages differ consistently in the
letters and letter sequences used. In other words, each language uses a unique or
very characteristic alphabet, and letters in the alphabet appear with surprisingly
consistent frequencies in any statistically significant text; in addition, the frequencies of
occurrence of sequences of two, three, four,
Architecture
five and more letters are characteristically stable within, and diverse among,
natural languages.
Where text is correctly spelled, even inside a computer, the presence or absence
of characteristic letters (or the integer mappings) is often quite sufficient for
language identification. When accentuation is lacking, the relative frequencies of
individual letters are often revealing.
Architecture
The input image taken from the user shall be passed through the image
acquisition model, its text derived from the image and the output given to the
language identification model for further processing.
The OCR module shall have a preprocessing submodule and a segmentation
module. The LI module shall have a preprocessing submodule, trigram segmenter
and corpus matcher that shall allow for the pattern matching between the ttrigrams
of the given text and the language corpus trigrams to find the winning trigram.
Architecture
Architecture
Data Flow Diagrams:
Use Case Diagram:
Activity Diagram
Activity Diagram
Class Diagram
Class Diagram
Module Implementation
When any input image is uploaded by the user, the application shall first perform so pre processing on the image to make it
easier to identify the text in it.
Gaussian Blur:
It calculates the modified pixel value for each pixel in the image using the Gaussian function,
G(x, y) = 1 √ 2πσ2 e − x 2+y 2 2σ2 (2)
where x shows how far the point is from the origin in the x axis, y shows how far the point is from the origin on y axis, and σ
is the distributed Gaussian deviation. This function smooths the image and eliminates noise.
Binarization:
In layman’s terms Binarization means converting a coloured image into an image which consists of only black and white
pixels (Black pixel value=0 and White pixel value=255). As a basic rule, this can be done by fixing a threshold (normally
threshold=127, as it is exactly half of the pixel range 0–255). If the pixel value is greater than the threshold, it is considered
as a white pixel, else considered as a black pixel.
Adaptive Thresholding: This method gives a threshold for a small part of the image depending on the characteristics of its
locality and neighbours i.e there is no single fixed threshold for the whole image but every small part of the image has a
different threshold depending upon the locality and also gives smooth transition.
Module Implementation
2) Skew Correction:
While scanning a document, it might be slightly skewed (image aligned at a certain angle with horizontal)
sometimes. While extracting the information from the scanned image, detecting & correcting the skew is crucial.
In this method, First, we’ll take the binary image, then
● project it horizontally (taking the sum of pixels along rows of the image matrix) to get a histogram of pixels
along the height of the image i.e count of foreground pixels for every row.
● Now the image is rotated at various angles (at a small interval of angles called Delta) and the difference
between the peaks will be calculated (Variance can also be used as one of the metrics). The angle at which
the maximum difference between peaks (or Variance) is found, that corresponding angle will be the Skew
angle for the image.
● After finding the Skew angle, we can correct the skewness by rotating the image through an angle equal to
the skew angle in the opposite direction of skew.
3) Noise Removal:
The main objective of the Noise removal stage is to smoothen the image by removing small dots/patches which
have high intensity than the rest of the image. Noise removal can be performed for both Coloured and Binary
images.
Module Implementation
Word Level Segmentation: At this level of segmentation, we are provided with an image containing a single line (segmented in
the previous step) which consists of a sequence of words. The objective of Word Level Segmentation is to segment the image
into words.
If we Vertically project the binary image,
● Columns that represent the text have high No.of foreground pixels, which correspond to higher peaks in the histogram.
● Columns that represent the gaps in-between the words have high No.of background pixels, which correspond to lower
peaks in the histogram.
● Columns that correspond to lower peaks in the histogram can be selected as the segmenting lines to separate the words.
For segmenting words, lower peaks should be selected in such a way that they should span through a certain width (threshold).
This is because we’ll find lower peaks which correspond to the gaps between disconnected characters within a word, which we
are not interested in. As we know, the gaps between the words are greater than the gaps between the characters within a word,
the threshold should be chosen in such a way that neglects the thin gaps between the characters within the words.
Module Implementation
For our language identification problem, we will be using character 3-grams/trigrams (i.e. sets of 3 consecutive characters).
We see an example of how sentences can be vectorised using trigrams. Firstly, we get all the trigrams from the sentences.
To reduce the feature space, we take a subset of these trigrams. We use this subset to vectorise the sentences.
The process for creating our trigram feature matrix is similar. The steps taken are:
● Using the training set, we select the 200 most common trigrams from each language
● Create a list of unique trigrams from these trigrams. The languages share a few common trigrams and so we end up
with a 663 unique trigrams
● Create a feature matrix, by counting the number of times each trigram occurs in each sentence
We now have the datasets in a form ready to be used to train our Neural Network. Softmax activation function is used in the
model’s output layer. This means we have to transform our list of target variables into a list of one-hot encodings.
Module Implementation
Before choosing the final model structure, I did a bit of hyperparameter tuning. I
varied the number of nodes in the hidden layers, the number of epochs and the
batch-size. The hyperparameter combination that achieved the highest accuracy
on the validation set was chosen for the final model.
The final model has 3 hidden layers with 500, 500 and 250 nodes respectfully.
The output layer has 6 nodes, one for each language. The hidden layers all have
ReLU activation functions and, as mentioned, the output layer has a softmax
activation function. We train this model using 4 epochs and a batch size of 100.
Using our training set and one-hot encoded target variable list, we train this DDN
in the code below. In the end, we achieve a training accuracy of 99.70%.
Demo
Input Image:
Demo
Demo
Demo
Preprocessed flipped image
References
Erik Sterneberg. Language Identification of Person Names Using Cascaded SVMs. Bachelor’s Thesis, Uppsala
University, Uppsala, 2012. Marija Stupar, Tereza Juri´c, and Nikola Ljubeˇsi´c.
Language Identification of Web Data for Building Linguistic Corpora. In Proceedings of the 3rd International Conference
on The Future of Information Sciences (INFuture 2011), pages 365–372, Zagreb, Croatia, 2011. Izumi Suzuki, Yoshiki
Mikami, Ario Ohsato, and Yoshihide Chubachi.
A Language and Character Set Determination Method Based on n-gram Statistics. ACM Transactions on Asian
Language Information Processing (TALIP), 1(3):269–278, September 2002. Hidayet Tak¸cı and Ekin Ekinci. Minimal
Feature Set in Language Identification and Finding Suitable Classification Method with it.
Procedia Technology, 1:444–448, January 2012. Hidayet Tak¸cı and Tunga G¨ung¨or. A High Performance Centroid-
based Classification Approach for Language Identification. Pattern Recognition Letters, 33(16):2077–2084, December
2012.
Liling Tan, Marcos Zampieri, Nikola Ljubeˇsi´c, and J¨org Tiedemann. Merging Comparable Data Sources for the
Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and
Using Comparable Corpora (BUCC), Reykjavik, Iceland, 2014. William John Teahan.
1. Grother, Patrick J. "NIST special database 19." Handprinted forms and characters database,
National Institute of Standards and Technology (1995).
2. LeCun, Yann, Corinna Cortes, and Christopher JC Burges. "The MNIST database of handwritten digits, 1998." http://yann. lecun. com/exdb/mnist 10 (1998): 34.
3. Mouchere, Harold, et al. "Crohme2011: Competition on recognition of online handwritten
mathematical expressions." 2011 international conference on document analysis and recognition. IEEE, 2011.
4. Acharya, Shailesh, Ashok Kumar Pant, and Prashnna Kumar Gyawali. "Deep learning based
large scale handwritten Devanagari character recognition." 2015 9th International Conference
on Software, Knowledge, Information Management and Applications (SKIMA). IEEE, 2015.
5. Zhou, Shusen, Qingcai Chen, and Xiaolong Wang. "HIT-OR3C: an opening recognition corpus
for Chinese characters." Proceedings of the 9th IAPR International Workshop on Document
Analysis Systems. ACM, 2010.
6. Wang, K. and Belongie, S., 2010, September. Word spotting in the wild. In European Conference on Computer Vision (pp. 591-604). Springer, Berlin, Heidelberg.
7. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B. and Ng, A.Y., 2011. Reading digits in
natural images with unsupervised feature learning

More Related Content

Similar to LSDI 2.pptx

Image Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd AswdrImage Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd AswdrMelanie Smith
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character RecognitionNitin Vishwari
 
OCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural NetworkOCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural Networkijsrd.com
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a surveyeSAT Journals
 
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...CSCJournals
 
IRJET- Document Layout analysis using Inverse Support Vector Machine (I-SV...
IRJET- 	  Document Layout analysis using Inverse Support Vector Machine (I-SV...IRJET- 	  Document Layout analysis using Inverse Support Vector Machine (I-SV...
IRJET- Document Layout analysis using Inverse Support Vector Machine (I-SV...IRJET Journal
 
Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...
Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...
Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...IRJET Journal
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfSachin414679
 
Image compression using negative format
Image compression using negative formatImage compression using negative format
Image compression using negative formateSAT Journals
 
Image compression using negative format
Image compression using negative formatImage compression using negative format
Image compression using negative formateSAT Publishing House
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontIRJET Journal
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET Journal
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
Wordoku Puzzle Solver - Image Processing Project
Wordoku Puzzle Solver - Image Processing ProjectWordoku Puzzle Solver - Image Processing Project
Wordoku Puzzle Solver - Image Processing ProjectSurya Chandra
 
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioHandwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioIRJET Journal
 
Indian sign language recognition system
Indian sign language recognition systemIndian sign language recognition system
Indian sign language recognition systemIRJET Journal
 

Similar to LSDI 2.pptx (20)

Image Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd AswdrImage Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd Aswdr
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
OCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural NetworkOCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural Network
 
Inpainting scheme for text in video a survey
Inpainting scheme for text in video   a surveyInpainting scheme for text in video   a survey
Inpainting scheme for text in video a survey
 
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
 
IRJET- Document Layout analysis using Inverse Support Vector Machine (I-SV...
IRJET- 	  Document Layout analysis using Inverse Support Vector Machine (I-SV...IRJET- 	  Document Layout analysis using Inverse Support Vector Machine (I-SV...
IRJET- Document Layout analysis using Inverse Support Vector Machine (I-SV...
 
Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...
Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...
Document Layout analysis using Inverse Support Vector Machine (I-SVM) for Hin...
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
 
Image compression using negative format
Image compression using negative formatImage compression using negative format
Image compression using negative format
 
Image compression using negative format
Image compression using negative formatImage compression using negative format
Image compression using negative format
 
Assignment-1-NF.docx
Assignment-1-NF.docxAssignment-1-NF.docx
Assignment-1-NF.docx
 
C04741319
C04741319C04741319
C04741319
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Wordoku Puzzle Solver - Image Processing Project
Wordoku Puzzle Solver - Image Processing ProjectWordoku Puzzle Solver - Image Processing Project
Wordoku Puzzle Solver - Image Processing Project
 
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioHandwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with Audio
 
Indian sign language recognition system
Indian sign language recognition systemIndian sign language recognition system
Indian sign language recognition system
 

Recently uploaded

Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...akbard9823
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Roomdivyansh0kumar0
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Roomdivyansh0kumar0
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Denver Web Design brochure for public viewing
Denver Web Design brochure for public viewingDenver Web Design brochure for public viewing
Denver Web Design brochure for public viewingbigorange77
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 

Recently uploaded (20)

Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
 
sasti delhi Call Girls in munirka 🔝 9953056974 🔝 escort Service-
sasti delhi Call Girls in munirka 🔝 9953056974 🔝 escort Service-sasti delhi Call Girls in munirka 🔝 9953056974 🔝 escort Service-
sasti delhi Call Girls in munirka 🔝 9953056974 🔝 escort Service-
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Denver Web Design brochure for public viewing
Denver Web Design brochure for public viewingDenver Web Design brochure for public viewing
Denver Web Design brochure for public viewing
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 

LSDI 2.pptx

  • 2. Architecture It is passed a buffer of mystery text, of arbitrary size, from the user environment, and it returns a buffer of information indicating the natural language in which the text is written. In a production environment, a Language Identifier, or several of them, can be enslaved to one or more schedulers, just like ASK and TransMatic. Users who want a text translated, but who don't know the source language, can then send the text for language identification rather than translation.
  • 3. Architecture The Language Identifier is based on mathematical source-language models that have been developed in the eld of cryptanalysis for help in breaking ciphers. The prototype has shown that these mathematical models can be successfully modified, reinterpreted and reapplied to the problem of natural-language identification. Automatic language identification is possible because alphabetically written natural languages are highly non-random and consistent in the letters and letter sequences that they use; and equally important, different languages differ consistently in the letters and letter sequences used. In other words, each language uses a unique or very characteristic alphabet, and letters in the alphabet appear with surprisingly consistent frequencies in any statistically significant text; in addition, the frequencies of occurrence of sequences of two, three, four,
  • 4. Architecture five and more letters are characteristically stable within, and diverse among, natural languages. Where text is correctly spelled, even inside a computer, the presence or absence of characteristic letters (or the integer mappings) is often quite sufficient for language identification. When accentuation is lacking, the relative frequencies of individual letters are often revealing.
  • 5. Architecture The input image taken from the user shall be passed through the image acquisition model, its text derived from the image and the output given to the language identification model for further processing. The OCR module shall have a preprocessing submodule and a segmentation module. The LI module shall have a preprocessing submodule, trigram segmenter and corpus matcher that shall allow for the pattern matching between the ttrigrams of the given text and the language corpus trigrams to find the winning trigram.
  • 14. Module Implementation When any input image is uploaded by the user, the application shall first perform so pre processing on the image to make it easier to identify the text in it. Gaussian Blur: It calculates the modified pixel value for each pixel in the image using the Gaussian function, G(x, y) = 1 √ 2πσ2 e − x 2+y 2 2σ2 (2) where x shows how far the point is from the origin in the x axis, y shows how far the point is from the origin on y axis, and σ is the distributed Gaussian deviation. This function smooths the image and eliminates noise. Binarization: In layman’s terms Binarization means converting a coloured image into an image which consists of only black and white pixels (Black pixel value=0 and White pixel value=255). As a basic rule, this can be done by fixing a threshold (normally threshold=127, as it is exactly half of the pixel range 0–255). If the pixel value is greater than the threshold, it is considered as a white pixel, else considered as a black pixel. Adaptive Thresholding: This method gives a threshold for a small part of the image depending on the characteristics of its locality and neighbours i.e there is no single fixed threshold for the whole image but every small part of the image has a different threshold depending upon the locality and also gives smooth transition.
  • 15. Module Implementation 2) Skew Correction: While scanning a document, it might be slightly skewed (image aligned at a certain angle with horizontal) sometimes. While extracting the information from the scanned image, detecting & correcting the skew is crucial. In this method, First, we’ll take the binary image, then ● project it horizontally (taking the sum of pixels along rows of the image matrix) to get a histogram of pixels along the height of the image i.e count of foreground pixels for every row. ● Now the image is rotated at various angles (at a small interval of angles called Delta) and the difference between the peaks will be calculated (Variance can also be used as one of the metrics). The angle at which the maximum difference between peaks (or Variance) is found, that corresponding angle will be the Skew angle for the image. ● After finding the Skew angle, we can correct the skewness by rotating the image through an angle equal to the skew angle in the opposite direction of skew. 3) Noise Removal: The main objective of the Noise removal stage is to smoothen the image by removing small dots/patches which have high intensity than the rest of the image. Noise removal can be performed for both Coloured and Binary images.
  • 16. Module Implementation Word Level Segmentation: At this level of segmentation, we are provided with an image containing a single line (segmented in the previous step) which consists of a sequence of words. The objective of Word Level Segmentation is to segment the image into words. If we Vertically project the binary image, ● Columns that represent the text have high No.of foreground pixels, which correspond to higher peaks in the histogram. ● Columns that represent the gaps in-between the words have high No.of background pixels, which correspond to lower peaks in the histogram. ● Columns that correspond to lower peaks in the histogram can be selected as the segmenting lines to separate the words. For segmenting words, lower peaks should be selected in such a way that they should span through a certain width (threshold). This is because we’ll find lower peaks which correspond to the gaps between disconnected characters within a word, which we are not interested in. As we know, the gaps between the words are greater than the gaps between the characters within a word, the threshold should be chosen in such a way that neglects the thin gaps between the characters within the words.
  • 17. Module Implementation For our language identification problem, we will be using character 3-grams/trigrams (i.e. sets of 3 consecutive characters). We see an example of how sentences can be vectorised using trigrams. Firstly, we get all the trigrams from the sentences. To reduce the feature space, we take a subset of these trigrams. We use this subset to vectorise the sentences. The process for creating our trigram feature matrix is similar. The steps taken are: ● Using the training set, we select the 200 most common trigrams from each language ● Create a list of unique trigrams from these trigrams. The languages share a few common trigrams and so we end up with a 663 unique trigrams ● Create a feature matrix, by counting the number of times each trigram occurs in each sentence We now have the datasets in a form ready to be used to train our Neural Network. Softmax activation function is used in the model’s output layer. This means we have to transform our list of target variables into a list of one-hot encodings.
  • 18. Module Implementation Before choosing the final model structure, I did a bit of hyperparameter tuning. I varied the number of nodes in the hidden layers, the number of epochs and the batch-size. The hyperparameter combination that achieved the highest accuracy on the validation set was chosen for the final model. The final model has 3 hidden layers with 500, 500 and 250 nodes respectfully. The output layer has 6 nodes, one for each language. The hidden layers all have ReLU activation functions and, as mentioned, the output layer has a softmax activation function. We train this model using 4 epochs and a batch size of 100. Using our training set and one-hot encoded target variable list, we train this DDN in the code below. In the end, we achieve a training accuracy of 99.70%.
  • 20. Demo
  • 21. Demo
  • 23. References Erik Sterneberg. Language Identification of Person Names Using Cascaded SVMs. Bachelor’s Thesis, Uppsala University, Uppsala, 2012. Marija Stupar, Tereza Juri´c, and Nikola Ljubeˇsi´c. Language Identification of Web Data for Building Linguistic Corpora. In Proceedings of the 3rd International Conference on The Future of Information Sciences (INFuture 2011), pages 365–372, Zagreb, Croatia, 2011. Izumi Suzuki, Yoshiki Mikami, Ario Ohsato, and Yoshihide Chubachi. A Language and Character Set Determination Method Based on n-gram Statistics. ACM Transactions on Asian Language Information Processing (TALIP), 1(3):269–278, September 2002. Hidayet Tak¸cı and Ekin Ekinci. Minimal Feature Set in Language Identification and Finding Suitable Classification Method with it. Procedia Technology, 1:444–448, January 2012. Hidayet Tak¸cı and Tunga G¨ung¨or. A High Performance Centroid- based Classification Approach for Language Identification. Pattern Recognition Letters, 33(16):2077–2084, December 2012. Liling Tan, Marcos Zampieri, Nikola Ljubeˇsi´c, and J¨org Tiedemann. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland, 2014. William John Teahan.
  • 24. 1. Grother, Patrick J. "NIST special database 19." Handprinted forms and characters database, National Institute of Standards and Technology (1995). 2. LeCun, Yann, Corinna Cortes, and Christopher JC Burges. "The MNIST database of handwritten digits, 1998." http://yann. lecun. com/exdb/mnist 10 (1998): 34. 3. Mouchere, Harold, et al. "Crohme2011: Competition on recognition of online handwritten mathematical expressions." 2011 international conference on document analysis and recognition. IEEE, 2011. 4. Acharya, Shailesh, Ashok Kumar Pant, and Prashnna Kumar Gyawali. "Deep learning based large scale handwritten Devanagari character recognition." 2015 9th International Conference on Software, Knowledge, Information Management and Applications (SKIMA). IEEE, 2015. 5. Zhou, Shusen, Qingcai Chen, and Xiaolong Wang. "HIT-OR3C: an opening recognition corpus for Chinese characters." Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010. 6. Wang, K. and Belongie, S., 2010, September. Word spotting in the wild. In European Conference on Computer Vision (pp. 591-604). Springer, Berlin, Heidelberg. 7. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B. and Ng, A.Y., 2011. Reading digits in natural images with unsupervised feature learning