SlideShare a Scribd company logo
1 of 4
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 214
Document Analyser Using Deep Learning
Pranali Dhawas1, Prajwal Kolhe2, Faizan Khan3, Lakshya Chauragade4, Apoorva Dhimole5
1 Assistant Professor, Department Of Artificial Intelligence, G H Raisoni College Of Engineering, Maharashtra,
India.
2345B.Tech Artificial Intelligence, G.H. Raisoni College of Engineering, Maharashtra, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Many companies and massive organizationshavenumerousdocumentsinbulkandrequiredtokeepthem indifferent
clusters. In recent years, this job has becoming time consuming as no of document and article has increased. The Analysis of the
document is one of the subjective research technique which is used to reviewbytheanalystestimatetheidea. Withthehelpofsome
visual technique we get the most of the layout and text format in extremely in proper output format. With thehelpofthe modelwe
can sort most of the architecture document in large scale. With the help of Layout model and new strategies to interact with
various layout in any format in a single model framework. We used our own data collected through out colleagues inuniversity to
train it in specific manner that it identify and then classify marksheets and Certificates. Our model use only some of the modern
techniques for the visual language task and also for the text images to find the some similar combinationandtaskwhichmake the
simple way to detect the interaction between the different stages.
1. INTRODUCTION
In today’s world there is a huge amount of text data, document digitization is a technique which is being used in various fields
and domains that deals with huge amounts of archives. Document analyzer focuses on categorizing document on basis of text,
image and layout of document, typically documents can be classified on differentlyon numerouscontext.Whileattemptingthe
task of analysing text documents document classification is the important procedure to follow. But while doing document
classification we have to deal with some challenges like high variability among the same document or class andlowvariability
between different classes or documents. Previous studies have addressed the structural similarity between classes or
documents. Studies has also focused on extracting characteristics to make each class separate. Researchers have developed
various deep learning approaches to improve the performance of their document classifiers. In 2014 researcher trained and
proposed a simple four layer CNN from starting. Then, we usetheImagenettoimprove thelearning rateofthenetwork towork
in effective way on the model. Most recent research have shown the usage of OCR to extract the particular information and
features to with multiple algorithms to classify, NLP was also used to boost the performance of these architecture. Our
architecture uses image features, text extraction, and layout of document to overcome pervious defects and drawbacks with
significant accuracy.
2. Literature Survey
In Recent time the Nlp and CV became more popular techniques in this area and also make progress in different fields.
• The devlin et al Explained the new language and rendering the model, which is used to see the the representation of
bidirectional form the various data by the jointly the condition.
• Barcelona Supercomputing Center – Upgrading the accuracy and edit classification image direct to the parallel system.
• Linkbert - Training the language model by Using Document link.
3. Methodology
Proposed solution in this paper “Document Analyzer” attempts to predict the class of a document by analysing the ssssimage
content. To solve this challenge we have addressed three ways – Image classification problem, text classification problemand
layout identification. We propose a multi-modal Transformer model to join the document in particular text, layout definition
and visual data inside the pretraining stage, the cross-modal learns the interaction in a single framework. The class of
document has been predicted according to various features like, the design of the document, header and footer of document,
body or the matter of the document which gets extracted by the OCR techniques and how document is being formatted, all of
these features help to find the exact class of the given document. But some type of documents also have common features for
example government certificate have seal of the govt. and/or logo, which can help classify the documents.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 215
Chart -1: Working Flow
The LayoutLM architecture is divided into 3 section text embedding visual embedding and layoutembedding. In this of the
embedding the common work is use to tokenize the ocr and the sequence of the text and giving some segment. The subword is
one algorithm used in the Natural language processing. the sentences is used to check the character in the above mention input
and to see the most common combination of the various character in the sentence.
Image Processing -This module deals with basic image preprocessing steps. This intends to do basic preprocessing like image
rescaling, image resizing and compression, morphological image preprocessing like Erosion andDilation. The output of this
module will be similar pre-processed images.
ScriptRecognition OR OCR -Optical character recognition is a technique to recognize textfromaimage.Aimofthismoduleisto
extract the text this is also called as textraction. This module will give the output as the unlabeled textdata.
Natural Language Preprocessing - The output from the last module which is text data will be pre-processed in this step. Aim of
this moduleis to clean the text data from previous module, we have used several techniques like removal of stop words,
stemming, lemmatization, this will further help in feature extraction.
Feature Extraction -Feature extraction is the process in which we try to gain the features from the raw text data intonumerical
format.
Encoding -Encoding is also called as vectorization. This process deals with converting the text data into numericalvectors.
Converting text data into high dimensional vector format. We can not feed the text data to the machine learning model so we
convert the text data into numerical format.
Classification -This module will help to classify the document on the basis of extracted features. We will use classification
algorithmslike NB classifier,Logistic Regression, RandomForest, etc.thiswillgivetheoutputclassofthedocumentbyassigning
one of the class from our available classes.
4. Dataset
In order to train and evaluate document classifier, we have collected above mentioned documents from thestudents from
different backgrounds and departments of G. H. Raisoni College of Engineering, Nagpur, Maharashtra.Eachclassofdocuments
is used in intention to classify the correct details from document image while using OCR technique. Documents are collected
from different state students. As these documents are from differentstates – thelayoutsofthedocumentsaredifferentforeach
state in India. For the Document Analyzer we have used various classes to classify the document such as SSC Marksheet, HSC
Marksheet, Cast Validity and Income Certificate.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 216
Fig-1: Sample Input Image.
5. Traning Accuracy
Fig-2: Traning Accuracy And Traning Loss of the Model.
6. Software and Hardware
1.Category: Machine Learning, Deep Learning.
2.Programming Language: Python.
3.Tools & Libraries: CNN, CUDA.
4.IDE: Jupyter, Google Colab.
5.Prerequisites: Python, Machine Learning, Deep Learning, Neural Network
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 217
7.Output
As we have uploaded document as shown in the above input phase for the prediction of the document type. We got a desire
prediction as shown in below snapshot.
Fig-3: The Output of the Model
8. CONCLUSIONS
We are able to Analyze Document using Pre-Data Training to the model. The model is able to provide a specific Document and
information efficient summary. We gave a multi-model pretraining way for visual document understanding tasks.
The Document analyze the categorizing document on basis of the text, image and layout of document, typically document can
be classified on different numerous context also. We have compared the different types of document, Original as well as
photocopy document while training period and detect the output accordingly.
REFERENCES
[1] Adam W. Harley, A. U. (2015). Evaluation of Deep Convolutional Nets for Document ImageClassification and
Retrieval. Toronto, Ontario: Ryerson
[2] N. Chen and D. Blostein. A survey of document image classification: Problem statement, classifierarchitecture and
performance evaluation. IJDAR, 10(1):1–16, 2007.
[3] K. Collins-Thompson and R. Nickolov. A clustering-based algorithm for automatic documentseparation.InSIGIR,pages1–
8, 2002.K.
[4] Image and Text fusion for UPMC Food-101 using BERT and CNNs Ignazio Gallo, Gianmarco Ria,Nicola Landro, and
Riccardo La Grassa
[5] Aggregated Residual Transformations for Deep Neural Networks Saining Xie, Ross Girshick, PiotrDollár, Zhuowen Tu,
Kaiming He
[6] LinkBERT Pretraining Language Models with Document Links Michihiro Yasunaga Jure LeskovecPercy Liang Stanford
University {myasu,jure,pliang}@cs.stanford.edu
[7] C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G.T. RadoandH.Suhl,Eds.NewYork:
Academic, 1963, pp. 271–350.
[8] Digitization of Soiled Historical Chinese Bamboo Scrolls. In Proceedings of the 13th IAPRInternational
[9] Workshop on Document Analysis Systems, DAS, Vienna, Austria, 24– 27 April 2018; pp. 55–60.[CrossRef]
[10] Mohammed, H.; Marthot-Santaniello, I.; Margner, V. GRK-Papyri: A Dataset of Greek Handwritingon
[11] Papyri for the Task of Writer Identification. In Proceedings of the 2019 International Conference onDocument
[12] Analysis and Recognition, ICDAR 2019, Sydney, Australia, 20–25 September 2019; pp. 726–731.[CrossRef]

More Related Content

Similar to Document Analyser Using Deep Learning

A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...IRJET Journal
 
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioHandwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioIRJET Journal
 
International journal of_software_engine
International journal of_software_engineInternational journal of_software_engine
International journal of_software_engineronan messi
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET Journal
 
Text Summarization and Conversion of Speech to Text
Text Summarization and Conversion of Speech to TextText Summarization and Conversion of Speech to Text
Text Summarization and Conversion of Speech to TextIRJET Journal
 
Automatic Grading of Handwritten Answers
Automatic Grading of Handwritten AnswersAutomatic Grading of Handwritten Answers
Automatic Grading of Handwritten AnswersIRJET Journal
 
Named Entity Recognition using Bi-LSTM and Tenserflow Model
Named Entity Recognition using Bi-LSTM and Tenserflow ModelNamed Entity Recognition using Bi-LSTM and Tenserflow Model
Named Entity Recognition using Bi-LSTM and Tenserflow ModelIRJET Journal
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontIRJET Journal
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep LearningIRJET Journal
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET Journal
 
IRJET- Image to Text Conversion using Tesseract
IRJET-  	  Image to Text Conversion using TesseractIRJET-  	  Image to Text Conversion using Tesseract
IRJET- Image to Text Conversion using TesseractIRJET Journal
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...IAESIJAI
 
Reviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringReviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringIRJET Journal
 
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...IRJET Journal
 
Named Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of ResumesNamed Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of ResumesIRJET Journal
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
 
Review of HR Recruitment Shortlisting
Review of HR Recruitment ShortlistingReview of HR Recruitment Shortlisting
Review of HR Recruitment ShortlistingIRJET Journal
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 

Similar to Document Analyser Using Deep Learning (20)

A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
 
Handwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with AudioHandwritten Text Recognition and Translation with Audio
Handwritten Text Recognition and Translation with Audio
 
International journal of_software_engine
International journal of_software_engineInternational journal of_software_engine
International journal of_software_engine
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema Generator
 
Text Summarization and Conversion of Speech to Text
Text Summarization and Conversion of Speech to TextText Summarization and Conversion of Speech to Text
Text Summarization and Conversion of Speech to Text
 
Automatic Grading of Handwritten Answers
Automatic Grading of Handwritten AnswersAutomatic Grading of Handwritten Answers
Automatic Grading of Handwritten Answers
 
Named Entity Recognition using Bi-LSTM and Tenserflow Model
Named Entity Recognition using Bi-LSTM and Tenserflow ModelNamed Entity Recognition using Bi-LSTM and Tenserflow Model
Named Entity Recognition using Bi-LSTM and Tenserflow Model
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
IRJET- Image to Text Conversion using Tesseract
IRJET-  	  Image to Text Conversion using TesseractIRJET-  	  Image to Text Conversion using Tesseract
IRJET- Image to Text Conversion using Tesseract
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...
 
Reviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringReviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clustering
 
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
 
Named Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of ResumesNamed Entity Recognition (NER) Using Automatic Summarization of Resumes
Named Entity Recognition (NER) Using Automatic Summarization of Resumes
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
Review of HR Recruitment Shortlisting
Review of HR Recruitment ShortlistingReview of HR Recruitment Shortlisting
Review of HR Recruitment Shortlisting
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Document Analyser Using Deep Learning

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 214 Document Analyser Using Deep Learning Pranali Dhawas1, Prajwal Kolhe2, Faizan Khan3, Lakshya Chauragade4, Apoorva Dhimole5 1 Assistant Professor, Department Of Artificial Intelligence, G H Raisoni College Of Engineering, Maharashtra, India. 2345B.Tech Artificial Intelligence, G.H. Raisoni College of Engineering, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Many companies and massive organizationshavenumerousdocumentsinbulkandrequiredtokeepthem indifferent clusters. In recent years, this job has becoming time consuming as no of document and article has increased. The Analysis of the document is one of the subjective research technique which is used to reviewbytheanalystestimatetheidea. Withthehelpofsome visual technique we get the most of the layout and text format in extremely in proper output format. With thehelpofthe modelwe can sort most of the architecture document in large scale. With the help of Layout model and new strategies to interact with various layout in any format in a single model framework. We used our own data collected through out colleagues inuniversity to train it in specific manner that it identify and then classify marksheets and Certificates. Our model use only some of the modern techniques for the visual language task and also for the text images to find the some similar combinationandtaskwhichmake the simple way to detect the interaction between the different stages. 1. INTRODUCTION In today’s world there is a huge amount of text data, document digitization is a technique which is being used in various fields and domains that deals with huge amounts of archives. Document analyzer focuses on categorizing document on basis of text, image and layout of document, typically documents can be classified on differentlyon numerouscontext.Whileattemptingthe task of analysing text documents document classification is the important procedure to follow. But while doing document classification we have to deal with some challenges like high variability among the same document or class andlowvariability between different classes or documents. Previous studies have addressed the structural similarity between classes or documents. Studies has also focused on extracting characteristics to make each class separate. Researchers have developed various deep learning approaches to improve the performance of their document classifiers. In 2014 researcher trained and proposed a simple four layer CNN from starting. Then, we usetheImagenettoimprove thelearning rateofthenetwork towork in effective way on the model. Most recent research have shown the usage of OCR to extract the particular information and features to with multiple algorithms to classify, NLP was also used to boost the performance of these architecture. Our architecture uses image features, text extraction, and layout of document to overcome pervious defects and drawbacks with significant accuracy. 2. Literature Survey In Recent time the Nlp and CV became more popular techniques in this area and also make progress in different fields. • The devlin et al Explained the new language and rendering the model, which is used to see the the representation of bidirectional form the various data by the jointly the condition. • Barcelona Supercomputing Center – Upgrading the accuracy and edit classification image direct to the parallel system. • Linkbert - Training the language model by Using Document link. 3. Methodology Proposed solution in this paper “Document Analyzer” attempts to predict the class of a document by analysing the ssssimage content. To solve this challenge we have addressed three ways – Image classification problem, text classification problemand layout identification. We propose a multi-modal Transformer model to join the document in particular text, layout definition and visual data inside the pretraining stage, the cross-modal learns the interaction in a single framework. The class of document has been predicted according to various features like, the design of the document, header and footer of document, body or the matter of the document which gets extracted by the OCR techniques and how document is being formatted, all of these features help to find the exact class of the given document. But some type of documents also have common features for example government certificate have seal of the govt. and/or logo, which can help classify the documents.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 215 Chart -1: Working Flow The LayoutLM architecture is divided into 3 section text embedding visual embedding and layoutembedding. In this of the embedding the common work is use to tokenize the ocr and the sequence of the text and giving some segment. The subword is one algorithm used in the Natural language processing. the sentences is used to check the character in the above mention input and to see the most common combination of the various character in the sentence. Image Processing -This module deals with basic image preprocessing steps. This intends to do basic preprocessing like image rescaling, image resizing and compression, morphological image preprocessing like Erosion andDilation. The output of this module will be similar pre-processed images. ScriptRecognition OR OCR -Optical character recognition is a technique to recognize textfromaimage.Aimofthismoduleisto extract the text this is also called as textraction. This module will give the output as the unlabeled textdata. Natural Language Preprocessing - The output from the last module which is text data will be pre-processed in this step. Aim of this moduleis to clean the text data from previous module, we have used several techniques like removal of stop words, stemming, lemmatization, this will further help in feature extraction. Feature Extraction -Feature extraction is the process in which we try to gain the features from the raw text data intonumerical format. Encoding -Encoding is also called as vectorization. This process deals with converting the text data into numericalvectors. Converting text data into high dimensional vector format. We can not feed the text data to the machine learning model so we convert the text data into numerical format. Classification -This module will help to classify the document on the basis of extracted features. We will use classification algorithmslike NB classifier,Logistic Regression, RandomForest, etc.thiswillgivetheoutputclassofthedocumentbyassigning one of the class from our available classes. 4. Dataset In order to train and evaluate document classifier, we have collected above mentioned documents from thestudents from different backgrounds and departments of G. H. Raisoni College of Engineering, Nagpur, Maharashtra.Eachclassofdocuments is used in intention to classify the correct details from document image while using OCR technique. Documents are collected from different state students. As these documents are from differentstates – thelayoutsofthedocumentsaredifferentforeach state in India. For the Document Analyzer we have used various classes to classify the document such as SSC Marksheet, HSC Marksheet, Cast Validity and Income Certificate.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 216 Fig-1: Sample Input Image. 5. Traning Accuracy Fig-2: Traning Accuracy And Traning Loss of the Model. 6. Software and Hardware 1.Category: Machine Learning, Deep Learning. 2.Programming Language: Python. 3.Tools & Libraries: CNN, CUDA. 4.IDE: Jupyter, Google Colab. 5.Prerequisites: Python, Machine Learning, Deep Learning, Neural Network
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 217 7.Output As we have uploaded document as shown in the above input phase for the prediction of the document type. We got a desire prediction as shown in below snapshot. Fig-3: The Output of the Model 8. CONCLUSIONS We are able to Analyze Document using Pre-Data Training to the model. The model is able to provide a specific Document and information efficient summary. We gave a multi-model pretraining way for visual document understanding tasks. The Document analyze the categorizing document on basis of the text, image and layout of document, typically document can be classified on different numerous context also. We have compared the different types of document, Original as well as photocopy document while training period and detect the output accordingly. REFERENCES [1] Adam W. Harley, A. U. (2015). Evaluation of Deep Convolutional Nets for Document ImageClassification and Retrieval. Toronto, Ontario: Ryerson [2] N. Chen and D. Blostein. A survey of document image classification: Problem statement, classifierarchitecture and performance evaluation. IJDAR, 10(1):1–16, 2007. [3] K. Collins-Thompson and R. Nickolov. A clustering-based algorithm for automatic documentseparation.InSIGIR,pages1– 8, 2002.K. [4] Image and Text fusion for UPMC Food-101 using BERT and CNNs Ignazio Gallo, Gianmarco Ria,Nicola Landro, and Riccardo La Grassa [5] Aggregated Residual Transformations for Deep Neural Networks Saining Xie, Ross Girshick, PiotrDollár, Zhuowen Tu, Kaiming He [6] LinkBERT Pretraining Language Models with Document Links Michihiro Yasunaga Jure LeskovecPercy Liang Stanford University {myasu,jure,pliang}@cs.stanford.edu [7] C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G.T. RadoandH.Suhl,Eds.NewYork: Academic, 1963, pp. 271–350. [8] Digitization of Soiled Historical Chinese Bamboo Scrolls. In Proceedings of the 13th IAPRInternational [9] Workshop on Document Analysis Systems, DAS, Vienna, Austria, 24– 27 April 2018; pp. 55–60.[CrossRef] [10] Mohammed, H.; Marthot-Santaniello, I.; Margner, V. GRK-Papyri: A Dataset of Greek Handwritingon [11] Papyri for the Task of Writer Identification. In Proceedings of the 2019 International Conference onDocument [12] Analysis and Recognition, ICDAR 2019, Sydney, Australia, 20–25 September 2019; pp. 726–731.[CrossRef]