A simple introduction to Natural Language Processing, with its examples, and how it works with the flowchart.
Natural Language Understanding, Natural Language Generation activities.
Presentation on the New Technology based on the recognition of letters that would be available on Soft and Hard copy both and allow all the format in Soft Copy. Optical character Recognition based on the recognition of letters with all the existing languages.
CHARACTER RECOGNITION USING NEURAL NETWORK WITHOUT FEATURE EXTRACTION FOR KAN...Editor IJMTER
Handwriting recognition has been one of the active and challenging research areas in the
field of pattern recognition. It has numerous applications which include, reading aid for blind, bank
cheques and conversion of any hand written document into structural text form[1]. As there are no
sufficient number of works on Indian language character recognition especially Kannada script
among 15 major scripts in India[2].In this paper an attempt is made to recognize handwritten
Kannada characters using Feed Forward neural networks. A handwritten kannada character is resized
into 60x40 pixel.The resized character is used for training the neural network. Once the training
process is completed the same character is given as input to the neural network with different set of
neurons in hidden layer and their recognition accuracy rate for different kannada characters has been
calculated and compared. The results show that the proposed system yields good recognition
accuracy rates comparable to that of other handwritten character recognition systems.
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
I have presented the power point presentation on Basics of the optical character recognition. Here i have focused to discuss about hoe OCR is used in scanning process and can it be used for document scanning and its uses.
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESijcsitcejournal
Optical Character Recognition (OCR) is the process which enables a system to without human intervention
identifies the scripts or alphabets written into the users’ verbal communication. Optical Character
identification has grown to be individual of the mainly flourishing applications of knowledge in the field of
pattern detection and artificial intelligence. In our survey we study on the various OCR techniques. In this
paper we resolve and examine the hypothetical and numerical models of Optical Character Identification.
The Optical character identification or classification (OCR) and Magnetic Character Recognition (MCR)
techniques are generally utilized for the recognition of patterns or alphabets. In general the alphabets are
in the variety of pixel pictures and it could be either handwritten or stamped, of any series, shape or
direction etc. Alternatively in MCR the alphabets are stamped with magnetic ink and the studying machine
categorize the alphabet on the basis of the exclusive magnetic field that is shaped by every alphabet. Both
MCR and OCR discover utilization in banking and different trade appliances. Earlier exploration going on
Optical Character detection or recognition has shown that the In Handwritten text there is no limitation
lying on the script technique. Hand written correspondence is complicated to be familiar through due to
diverse human handwriting style, disparity in angle, size and shape of calligraphy. An assortment of
approaches of Optical Character Identification is discussed here all along through their achievement.
A simple introduction to Natural Language Processing, with its examples, and how it works with the flowchart.
Natural Language Understanding, Natural Language Generation activities.
Presentation on the New Technology based on the recognition of letters that would be available on Soft and Hard copy both and allow all the format in Soft Copy. Optical character Recognition based on the recognition of letters with all the existing languages.
CHARACTER RECOGNITION USING NEURAL NETWORK WITHOUT FEATURE EXTRACTION FOR KAN...Editor IJMTER
Handwriting recognition has been one of the active and challenging research areas in the
field of pattern recognition. It has numerous applications which include, reading aid for blind, bank
cheques and conversion of any hand written document into structural text form[1]. As there are no
sufficient number of works on Indian language character recognition especially Kannada script
among 15 major scripts in India[2].In this paper an attempt is made to recognize handwritten
Kannada characters using Feed Forward neural networks. A handwritten kannada character is resized
into 60x40 pixel.The resized character is used for training the neural network. Once the training
process is completed the same character is given as input to the neural network with different set of
neurons in hidden layer and their recognition accuracy rate for different kannada characters has been
calculated and compared. The results show that the proposed system yields good recognition
accuracy rates comparable to that of other handwritten character recognition systems.
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
I have presented the power point presentation on Basics of the optical character recognition. Here i have focused to discuss about hoe OCR is used in scanning process and can it be used for document scanning and its uses.
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESijcsitcejournal
Optical Character Recognition (OCR) is the process which enables a system to without human intervention
identifies the scripts or alphabets written into the users’ verbal communication. Optical Character
identification has grown to be individual of the mainly flourishing applications of knowledge in the field of
pattern detection and artificial intelligence. In our survey we study on the various OCR techniques. In this
paper we resolve and examine the hypothetical and numerical models of Optical Character Identification.
The Optical character identification or classification (OCR) and Magnetic Character Recognition (MCR)
techniques are generally utilized for the recognition of patterns or alphabets. In general the alphabets are
in the variety of pixel pictures and it could be either handwritten or stamped, of any series, shape or
direction etc. Alternatively in MCR the alphabets are stamped with magnetic ink and the studying machine
categorize the alphabet on the basis of the exclusive magnetic field that is shaped by every alphabet. Both
MCR and OCR discover utilization in banking and different trade appliances. Earlier exploration going on
Optical Character detection or recognition has shown that the In Handwritten text there is no limitation
lying on the script technique. Hand written correspondence is complicated to be familiar through due to
diverse human handwriting style, disparity in angle, size and shape of calligraphy. An assortment of
approaches of Optical Character Identification is discussed here all along through their achievement.
The lecture presents the open source project Tesseract - a free OCR engine written in C++. The lecture presents the strong and weak sides of tesseract and explains how to train it in a new language. The lecture demonstration materials are available at the authors's blog: http://www.nakov.com/blog
The Internet of Things (IoT) includes huge quantities of things across all industries, these slides are things about buildings stored in computers, and the semantics of architecture
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
Language modeling plays a critical role in many
natural language processing (NLP) tasks such as text prediction,
machine translation and speech recognition. Traditional
statistical language models (e.g. n-gram models) can only offer
words that have been seen before and can not capture long word
context. Neural language model provides a promising solution to
surpass this shortcoming of statistical language model. This paper
investigates Recurrent Neural Networks (RNNs) language model
for Vietnamese, at character and syllable-levels. Experiments
were conducted on a large dataset of 24M syllables, constructed
from 1,500 movie subtitles. The experimental results show that
our RNN-based language models yield reasonable performance
on the movie subtitle dataset. Concretely, our models outperform
n-gram language models in term of perplexity score.
Imago OCR: Open-source toolkit for chemical structure image recognitionMikhail Rybalkin
http://ggasoftware.com/opensource/imago
Presentation at the Symposium on 244th ACS National Meeting & Exposition.
Hunting for Hidden Treasures: Chemical Information in Patents and Other Documents
Introduction to .NET
.NET Architecture and factors
Code conversion in .NET
C# Language
Text to speech(TTS) converter
Steps for TTS Converter process
Architecture of TTS converter
Other features
Applications
Advantages
Limitations and future scope
Snapshots of the project
THIS PPT CONTAINS THE DETAILS ABOUT THE VARIOUS LANGUAGE PROCESSORS/LANGUAGE TRANSLATORS- THE COMPILER & THE INTERPRETER, OPERATING SYSTEMS & ITS FUNCTION, PARALLEL & CLOUD COMPUTING
The aim of this project is to make online university service based on composite service offered by service producer. Application is constructed out of services like student information service, payment gateway service, adhar information service. In this project, student information service is open to all university application so that information distribution is possible.
Description logic is a formal logic-based knowledge representation language which “Description" about the world in terms of concepts (classes), roles (properties, relationships) and individuals (instances).
Remote Authentication Dial In User Service is a networking protocol that provides centralized Authentication, Authorization, and Accounting (AAA) management for computers to connect and use a network service.
A Multi-Criteria Evaluation of Environmental databases using Hasse diagram technique-It is a multi-criteria evaluation method which can be used as a tool to rank objects and is hence also applicable to decision making.
The HDT reveals the best and the worst databases and conflicts among them, due to different information content.
SOAP is a simple and flexible messaging framework for transferring information specified in the form of an XML infoset between an initial SOAP sender and ultimate SOAP receiver.
Distributed database system is collection of loosely coupled sites that are independeant of each other.
Distributed transaction model
Concurrency control
2 phase commit protocol
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
3. What is OCR?
Optical Character
Recognition, usually
abbreviated to OCR, is
the conversion of scanned
photo/ Images of typewritten
or printed text into machine-
encoded/computer-readable
text.
Introduction to OCR
4. Types of OCR
Basically, there are three types of OCR. They are briefly discussed
below:
Handwritten Text OCR
The text produced by a person by writing with a pen/ pencil on a paper
medium and which is then scanned into digital format using scanner is
called Handwritten Text.
Machine Printed Text OCR
Machine printed text can be found commonly in daily use. It is
produced by offset processes, such as laser, inkjet and many more. The
project comes under this category.
Introduction - Type of OCR
5. TAMIL (Tamil OCR using Multidimensional Interactive Learning
model) is Optical Character Recognition Software that convert machine
printed text into editable text. It is mainly targeting to Tamil language. It
uses tesseract OCR Engine.
Tesseract OCR engine has trained in Tamil language so this software can
convert an image in to text in Tamil language ; it involves the Tamil
Tessdata files for mapping of each character in image into text. While trying
to recognize a limited range of fonts (like a single font for instance), then a
single training page might be enough. Tesseract was originally designed to
recognize English text only. Efforts have been made to modify the engine
and its training system to make them able to deal with other languages and
UTF-8 characters.
TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE
LEARNING MODEL
Abstract
6. Tesseract-OCR is the most widely used open source OCR
across the world. Currently this OCR supports English language as
default and few more language and it is a command line tool. Tesseract
OCR Engine has flexibility that it can be trained to any language. In this
project an application is developed to train OCR in Tamil languages.
More Training Images and effort are needed to produce the training data.
Since Tesseract-OCR is a command line tool, it is not mostly used by the
beginners which limit the usage of tesseract. Here an excellent GUI was
developed for tesseract-OCR, which makes peoples to use this OCR
easily.
Tamil OCR GUI is multidimensional and interactive. It uses espeak TTS
engine to convert Tamil Text into voice. So it much beneficial for blind
people.
Abstract
TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE
LEARNING MODEL
8. Requirements
MILE Lab Tamil language Text-To-Speech (TTS) Engine
Google Transliterations API
Google Speech Service
Tamil and English Spell Check Dictionaries
Microsoft Speech API (SAPI) 5.4 (SAPI)
Tessdata with Tamil Trained Data file
Web browser Component
VC++ Runtime
Microsoft Office Interoperability Word Service
Tamil Fonts and Tamil Keyboard software
Windows Operating system(From windows XP to Windows 8)
9. Requirements
User Software Requirements
VC++ Runtime 2010
.NET Framework 4.0
Hardware Requirements:
Minimum 260MB RAM memory, preferably 600MB RAM
700MB free hard disk space
Modem for internet connection
10. Software Requirements - Tesseract
OCR Engine
The Tesseract engine was originally developed as proprietary
software at Hewlett Packard labs in Bristol, England and Greeley,
Colorado between 1985 and 1994, with some more changes made in
1996 to port to Windows, and some migration from C to C++ in
1998. A lot of the code was written in C, and then some more was
written in C++. Since then all the code has been converted to at
least compile with a C++ compiler. Very little work was done in the
following decade. It was then released as open source in 2005 by
Hewlett Packard and the University of Nevada, Las Vegas (UNLV).
Tesseract development has been sponsored by Google since 2006.
Tesseract does not come with a GUI and is instead run from
the command-line interface.
11. Tesseract was in the top three OCR engines in terms of character accuracy in
1995. It is available for Linux, Windows and Mac OS X, however, due to
limited resources only Windows and Ubuntu are rigorously tested by
developers.
Tesseract up to and including version 2 could only accept TIFF images of
simple one column text as inputs. These early versions did not include layout
analysis and so inputting multi-columned text, images, or equations
produced a garbled output. Since version 3.00 Tesseract has supported
output text formatting, hOCR positional information and page layout
analysis. Support for a number of new image formats was added using the
Leptonica library. Tesseract can detect whether text is monospaced or
proportional.
Software Requirements - Tesseract
OCR Engine
12. Software Requirements - Tesseract
OCR Engine
The initial versions of Tesseract could only recognize English language text.
Starting with version 2 Tesseract was able to process English, French, Italian,
German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it
can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified
and Traditional), Danish, German (standard and Fraktur script), Greek,
Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese,
Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese,
Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish,
Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese.
Tesseract can be trained to work in other languages too
13. Each box results in a component which represents a single
character. The box has an x- and y-coordinate, a width and height
and the character which has been recognized in the region.
we have to manually edit possible mistakes that has been detected
by Tesseract. Maybe all characters where wrong because the font
you are using isn’t very common. To make this job a lot easier,
several tools are available. These tools are called “Tesseract box
editors”, the tool I used is jTessBoxEditor. jTessBoxEditor is written
and Java and thus platform independent, it has options to merge
and split which could be handy. After you corrected the mistakes,
we can creating the training file.
Software Requirements –
Box Editor
14. jTessBoxEditor is a box editor and trainer for Tesseract
OCR, providing editing of box data of both Tesseract 2.0x
and 3.0x formats and full automation of Tesseract
training. It can read images of common image formats,
including multi-page TIFF. The program requires Java
Runtime Environment 6.0 or later.
Software Requirements -
jTessBoxEditor
15. Text-To-Speech Engine
Introduction to TTS
A text-to-speech software or speech synthesis program is used
to read a text out loud to a user. Espeak can be used for this
purpose. A TTS engine uses natural human voices to avoid
sounding like a robot during the reading of a text to a user.
There are also few different free TTS engines available online
where the user can choose the voice he prefers the text to be
read with. One I used is MILE Lab Tamil language Text-To-
Speech Engine.
Software Requirements - TTS
16. Espeak
ESpeak is a compact open source software speech
synthesizer for Linux, Windows, and other platforms. It uses
a formant synthesis method, providing many languages in a small
size and it has also been used by Google Translate.
ESpeak is derived from the "Speak" speech synthesizer for British
English for Acorn RISC OS a computer which was originally written
in 1995 by Jonathan Duddington. A rewritten version for Linux
appeared in February 2006 and a Windows SAPI 5 version in
January 2007. Subsequent development has added and improved
support for additional languages.
In my project espeak is used as offline Tamil text to speech engine.
It support multi language speech.
Software Requirements - ESpeak
17. Microsoft Speech API (SAPI)
The Microsoft text-to-speech voices are speech
synthesizers provided for use with applications that
use the Microsoft Speech API (SAPI) or the Microsoft
Speech Server Platform. There are client and server
versions of Microsoft text-to-speech voices. SAPI
used to produce interactive speech feature in the
project
Software Requirements -Microsoft
Speech API (SAPI)
18. Hunspell
Hunspell is a spell checker and morphological analyser designed
for languages with rich morphology and complex word
compounding and character encoding, originally designed for
the Hungarian language.
Hunspell is based on MySpell and is backward-compatible with
MySpell dictionaries. While MySpell uses a single-byte character
encoding, Hunspell can use Unicode UTF-8-encoded dictionaries.
Hunspell The spell checker Hunspell needs two files:
1. Dictionary files with .DIC extension
2. Affix file with .AFF extension
Software Requirements - Hunspell
19. Ghostscript is a suite of software based on an interpreter for
Adobe Systems' PostScript and Portable Document Format
(PDF)page description languages. Its main purposes are the
rasterization or rendering of such page description language
files, for the display or printing of document pages, and the
conversion between PostScript and PDF files.
GhostScript is open source it is used in the project for
converting PDF files to image. PDF OCR uses this GhostScript.
Software Requirements - GhostScript
20. Existing System
The existing tesseract-OCR supports English language as
default and also supports languages like Dutch, Spanish, Italian,
French and German. All these languages are trained to
tesseract OCR, but languages like Tamil, Malayalam are not
much trained to OCR. There are no GUI available for tesseract
in tamil and training tesseract is a big task which an
intermediate persons too feel complex for training it. Since
Tesseract OCR Engine is command line Tool, usage of OCR is
much less.
Analysis
21. There are some OCR GUI are built using Tesseract OCR Engine, but it
does not have much support for Tamil language.
Some GUI tools are listed below.
VietOCR
Tesseract-OCR QT4 gui
Lime OCR
Few Online Services:
CustomOCR
Free OCR
i2OCR(support Tamil language, but very less accuracy)
Analysis-Existing System
22. The proposed system
The proposed system has to be GUI. GUI of Proposed system is named as
“Tamil OCR GUI”. It is trained to Tamil language, using Tesseract Training
Procedures. It will much beneficial for blind people and normal users.
The inputs to the OCR-Engines are:
Sample Tamil Training Images
Data Files
Tamil Dictionary
Final Tamil trained data
As I will use the programming language VB.NET, it satisfies all the needs
as extensibility, simplicity, interoperability, portability, powerful data
structures and also Unicode support. The GUI was developed in VB.NET,
which is Threaded application.
Analysis- proposed system
23. GUI is Web browser And Special OCR GUI in this project and the
it going to capture web page images and extract all its Tamil and
English text using the Tamil OCR component.
Proposed System has support of following language:
1. Tamil
2. English
User can switch between these language and they can easily
OCR Tamil and English Images using Proposed System.
Analysis- proposed system
24. Following are process of Tesseract OCR Engine:
At the first stage, outlines of the text are gathered by nesting, into Blobs. These blobs are
organized into text lines and are broken into words differently according to the kind of
character spacing. After that the lines and regions are analysed for fixed pitch or proportional
text. Fixed pitch text is chopped immediately by character cells, however the proportional
text is broken into words using definite spaces and fuzzy spaces. Then the recognition
proceeds as a two-pass process.
In the first pass, it attempt recognize each word in turn passed to an adaptive
classifier as training data. The adaptive classifier then gets a chance to more accurately
recognize text. Thereafter the second pass is run over the page for the words those were
not recognized well enough in the First pass.
At last a final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-
height to locate small cap text. We follow this architecture of Tesseract OCR engine to
recognize the Tamil characters.
To train Tesseract for Tamil script recognition, I followed Complete Training procedure
Specified.
process of Tesseract OCR Engine:
25. Tamil OCR Component:
Tesseract OCR Engine trained for tamil language
Image Extractor
Image Reader
Output Manager
Output formatter
Text Speaker
Analysis-DFD
27. DATA FILES REQUIRED
To train Tamil Language (lang = tam), I have to created 8 data files in the
tessdata subdirectory. The 8 files are:
1. tessdata/tam.freq-dawg
2. tessdata/tam.word-dawg
3. tessdata/tam.user-words
4. tessdata/tam.inttemp
5. tessdata/tam.normproto
6. tessdata/tam.pffmtable
7. tessdata/tam.unicharset
8. tessdata/tam.DangAmbigs
Analysis- Training Procedure:
28. Analysis- Web OCR Component
Start
Image Extractor
Tesseract Tamil Trained OCR
Engine
Output Formatter Output Manager
Image Reader
Stop
Webpage
(Web Brower)
Web GUI Page
Tamil Trained
Dataset
Data’s
Fig: Architecture of Web OCR Component
29. 1 Web OCR Component Extract all pictures of web page and save all in
single folder called ‘OCR Image’ (image Extractor)
2. Image Reader will take picture by picture and send to Tesseract
Engine.
3. Tesseract Engine will process and extract text from images only
Tamil and English
4. Output of Tesseract Engine is taken by Output Manger that will
maintain text and images.
5. Output Formatter will put Images and Corresponding Extracted
Text in word document
6 .if User move over any image all process are carried out from 1 to 4
and Text Speaker component will speak out to user.
Analysis- Web OCR Component
30. Web OCR feature need web browser and it extract all images
from particular page and OCR in the Image. Web OCR has
Following Tools:
Image Grabber
Link Grabber
Web Page to Image Convertor
Web page to PDF Convertor
Screen Shotter
All the above Tools are essential to do Web OCR. Image Grabber
will extract all images and save to Location user Specified. All
extracted images are saved to Folder.
Analysis- Web OCR Component
32. One of the components of project is Web OCR, as I need Better
GUI it is good to use web browser, because it is software in
which most user and image files are there. This browser module
has low priority in this project. All below mention features are I
mentioned may not be implementing. Simple browser with
required features is enough to developing web browser for
computer using following technologies and concepts are
required:
Vb.net
Application programmer interface(API’s)
Multithreading
Voice library(voice recognition)
Analysis- Web OCR Component
33. Analysis- Tamil OCR GUI
Start
Input Image
Preprocessing
Tesseract Engine Tamil Trained Dataset
Tamil OCR
Output Text Text
Post Processing
Spell Checker
Tamil Dictionary
Tamil Data files
Accurate Output Text Text
Stop
34. Analysis- Tamil OCR GUI
Input Image
Pre - process
Tamil OCR GUI Output Text
Tesseract OCR Engine Post Process
Tamil and English
Trained Data Set
Output Formats
1. PDF
2. MS Word
3. RTF
4. XML
5. WAV
6. MP3
8. HTML
9. Text
10. Single Web page
Fig: Block Diagram of Full Image OCR
35. Analysis- Tamil OCR GUI
Post Process
Tamil OCR GUI
Selected Image Region
Output Text
Tesseract OCR EnginePre - process
Tamil and English Trained Data Set
Fig: Block Diagram of Region Image OCR
38. Analysis- Pre-processing
Increase dpi (max 300)
Start
Input Image
Convert to black and white
Preprocess Manager
Remove Background (max 300)
Remove Inner images (max 300)
Preprocessed Image
Fig: Architecture of Pre-processor
Pre-processing is optional process in Tamil OCR.
This is useful to increase Quality of image so that
process of OCR will be more accurate.
39. The known system is divided into eight functional
modules, which are Independent among them and
they are followed:
This Project has following modules:
1) Tamil Training to Tesseract
2) Web OCR Component
3) OCR GUI
Design - Modularity
40. Introduction
The system is designed in two phase:
Preliminary System Design
Detailed System Design
Coupling is the measure of relative interdependence among modules. In Tamil
OCR project Coupling is very loose in nature because we can easily add any
module without problem with any other module or require little modification in
interface calling. Coupling depends of interface complexity between modules,
the point at which entry of reference is made to a module, and what data pass
across the interface.
Cohesion is the measure of relative functional strength of individual modules. In
Tamil OCR project Cohesion is very high in nature because performs a single task
within
a software procedure, requiring little interaction with procedures being
performed in other parts of a program.
Design
41. 1.Training Tesseract-OCR Task
Tesseract 3.02 is fully trainable. In Tesseract 2.0 it will only accept
Tiff format images, but in the latest version it include leptonica it will
automatically convert any image into tiff format.
This page describes the training process, provides some guidelines
on applicability to various languages, and what to expect from the
results. Tesseract was originally designed to recognize English text
only. Efforts have been made to modify the engine and its training
system to make them able to deal with other languages and UTF-8
characters.
Design - Modularity
42. 2.Web Tamil OCR-Component Development
1. Web Tamil OCR Component Extract all pictures of web page and save all in
single folder called ‘OCR Image’ (image Extractor)
2. Image Reader will take picture by picture and send to Tesseract Engine.
3. Tesseract Engine will process and extract text from images only Tamil and
English
4. Output of Tesseract Engine is taken by Output Manger that will maintain
text and images.
5. Output Formatter will put Images and Corresponding Extracted Text in
word document
6 .if User move over any image all process are carried out from 1 to 4 and
Text Speaker component will speak out to user.
Design - Modularity
43. 3 .Web browser
Main component of project is Tamil OCR, as I need Better GUI it is
good to use web browser, because it is software in which most user are
there. This browser module has low priority in this project. Tesseract-OCR is
a command line tool. The Project GUI is going to developed using .net lets
user to easily work with a graphical user interface. Project targeting the
development of component in web browser that component will extract
Tamil character from images.
The user interface for OCRgui contains the open button to select the image
(tiff format) file, an recognition button that converts the image to an
editable text, a font button that select the font type for the output and
preferences window that have all preferences. Web browser will use Tamil
OCR Component to extract all images text if Tamil and English.
Design - Modularity
44. Cond….(Web browser )
The software interfaces used in the project are built VB.NET. It
leads to easily create programs with a graphical user interface
using the Visual basic.net. The Project does not make use of
any particular protocol and therefore there may not be need
for any specified communication interfaces.
Design - Modularity
45. 4 .OCR GUI
OCR GUI allow user to perform various OCR functions more easily
in Tamil. This module provides following features to user.
1. Whole Image OCR
2. Image Region OCR
3. Snap Shot OCR
4. Snipping Screen OCR
5. Batch OCR
6. Various Image Processing functionality
7. Good Tamil Text Editor
8. Easley Tamil Transliteration Typing (like phonetic)
9. Good Tamil Text to Speech
Design - Modularity
46. Cont. …..(OCR GUI)
Design - Modularity
10. Image Convertor
11. PDF OCR
12. Various Output Format
13. Various Input Acceptance
14. Easy Typing in Other languages
15. Email Sender
16. Spell checking
17. Snipping and Saving
18. Combined OCR document
19. OCR report
79. Tesseract-OCR for Tamil language
Increases the number of users of tesseract
GUI makes more people to migrate towards open source
It makes easy for the people to work tesseract, without
the knowledge about it.
Web browser can make use the OCR Tamil component in
the web browser
For blind people it is very good beneficial
Advantages of Project
80. The applications of Tesseract-OCR are
News paper industries
Books publishers.
Industries of Digital Expertise
Web browser
Web Service Creation
Web Application
Common People’s
Advantages of Project
81. FUTURE ENHANCEMENT
The accuracy of recognition can be improved by more
training
AI Applied to make intelligent snapshotting Reader
Image Content Search
More Accurate and powerful Spell Checker Corrector
Powerful web OCR
Web Service
Online Web Application
Handwritten Recognition
Browser Extensions
More Powerful Image Preprocessing
Future Enhancement
82. I hope the result of the project is that the Tesseract-OCR is
trained for Tamil language and the more training can be used to
train the OCR for tamil language , so accuracy can be increased.
This Project can increase the number of users to tesseract OCR
among tamil people and most of the Tamil press peoples to
migrate towards free and open source software’s.
The GUI for tesseract will make easy for all the peoples who
don’t have knowledge about tesseract to use it in a perfect
manner.
Conclusion
83. The Tamil OCR Project was a great opportunity for me to experience the workings of
software, know about research’s and bring my 2 years of academics to practice. The 4 months
period of Project was very useful for learning the industry specific practices and standards. I am
very confident that this experience will be a great boost and help for my career ahead.
The period at Dr.K.S kuppusamy was a great learning experience with very helpful project
supervisors. Their evaluation and guiding helped me a lot to learn and understand new techniques
and methodologies. Overall, it was a great experience and will greatly benefit me in future.
The biggest advantage of Tesseract OCR is its availability as open code. Thus anybody having
the interest to study the working procedure, and skill to improve it can able to train it for a new
language. In this Document, I presented the step by step procedure to train Tesseract engine for
Tamil printed text document. At first we train the Tesseract for a particular font type of English
language that has not been supported earlier by performing a series of test. We then train
Tesseract to recognize the Tamil character set, and observed the results. As we find editing the
box file manually is a cumber-sum task (this language has a large character set), we try to
generate the box file automatically. Also we could be able to detect the vowels and consonants of
Tamil character set, however still we need to train Tesseract for dependent modifiers and other
characters that exist in the Tamil documents, in future.
Conclusion
84. 1. The tesseract-OCR Google groups is developed by “Ray Smith”
and is accessed http://code.google.com/p/tesseract-OCR/
2. Training the tesseract-OCR for Indic languages can be obtained
from http://code.google.com/p/tesseract-
OCR/wiki/TrainingTesseract
3. The tesseract Indic home page is developed and maintained by
“Debayan Banerjee” and can be found
at http://code.google.com/p/tesseractindic/ and etc…
REFERENCES