Biology for Computer Engineers Course Handout.pptx
Image recognition
1. A MINOR PROJECT REPORT
On
IMAGE RECOGNITION
Submitted in partial fulfilment of the requirement for the award of the degree of
B. TECH
in
COMPUTER SCIENCE AND ENGINEERING
Submitted By
BHASKAR TRIPATHI :RA1611003040016
JOEL JOSE : RA1611003040128
DEPT. OF COMPUTER SCIENCE & ENGINEERING
SRM Institute of Science & Technology
Vadapalani Campus, Chennai
OCTOBER 2018
2. BONAFIDE CERTIFICATE
Certified that this project report “IMAGE RECOGNITION“is the bonafide work of “BHASKAR
TRIPATHI ASHWINI KUMAR and JOEL JOSE” who carried out the project work under my
supervision.
SIGNATURE OF THE GUIDE SIGNATURE OF THE HOD
Dr.P.Mohamed Fathimal B.E.,M.E.,PhD. Dr.S. Prasanna Devi, B.E., M.E., Ph.D.,
PGDHRM., PDF(IISc)
Assistant Professor Professor
Department of Computer Science and
Engineering
Department of Computer Science and
Engineering
SRM Institute of Science & Technology SRM Institute of Science & Technology
Vadapalani Campus Vadapalani Campus
3. I
ACKNOWLEDGEMENT
It is our privilege to express our sincerest regards to our project coordinator, Dr.P.Mohamed Fathimal for
her valuable inputs, able guidance, encouragement, whole-hearted cooperation and constructive criticism
throughout the duration of our project.
We deeply express our sincere thanks to our Head of Department Dr. S. Prasanna Devi, for encouraging
and allowing us to present the project on the topic “IMAGE RECOGNITION “ at our department
premises for the partial fulfillment of the requirements leading to the award of B-Tech degree.
We take this opportunity to thank all our faculty members, Dean, Dr.K.Duraivelu, and Management who
have directly or indirectly helped our project. Last but not the least we express our thanks to our friends
for their cooperation and support.
4. II
TABLE OF CONTENT Page No
ABSTRACT
CHAPTER 1
Introduction
1.1 Introduction of Project
1.2 Overview
CHAPTER 2
About Project
1.3 Purpose
2.2Project Scope
1.4 Existing System
1.5 Drawback Of Existing System
1.6 Proposed System
1.7 Benifits Of Proposed System
1.8 System Specifications
CHAPTER 3
3.1 Tools And Technology
3.2 Architecture Of Proposed System
3.3 Problem Statement
3.4 Modules and their Functionalities
CHAPTER 4
System Study
4.1 Data Flow Diagram
4.2 UML Diagram
CHAPTER 5
Modules
5. III
5.1 Main Modules
5.2 Face Detection
5.3 Feature Extraction
5.4 Recognition
CHAPTER 6
System Testing
CHAPTER 7
Source Code
7.1 Text Recognition
7.2 Face Recognition
7.3 Landmark Detection
7.4 Label Detection
CHAPTER 8
Screenshots
CHAPTER 9
Conclusion
REFRENCES
6. IV
LIST OF FIGURES
Figures Page No.
3.2 Architecture of proposed System
4.1 proposed model for real-time classification
4.2.1 Use-Case Diagram For Document Editing
4.2.2 Class Diagram
4.2.3 Sequence Diagram for Processing
4.2.3.a Sequence Diagram for Training
4.2.4 Sequence Diagram for Recognition
4.2.5 Sequence Diagram for Editing
6.1 Normalized confusion matrix of our mini-Xception network.
6.2 provided real-time emotion classification provided public repository
Screenshots
7. 1
ABSTRACT
In this paper we propose an implement a general convolutional neural network (CNN)
building framework for designing real-time CNNs. We validate our models by creating a real-
time vision system which accomplishes the tasks of face detection, gender classification and
emotion classification simultaneously in one blended step using our proposed CNN
architecture. After presenting the details of the training procedure setup we proceed to
evaluate on standard benchmark sets. We report accuracies of 96% in the IMDB gender
dataset and 66% in the FER-2013 emotion dataset. Along with this we also introduced the
very recent real-time enabled guided backpropagation visualization technique. Guided back-
propagation uncovers the dynamics of the weight changes and evaluates the learned features.
We argue that the careful implementation of modern CNN architectures, the use of the current
regularization methods and the visualization of previously hidden features are necessary in
order to reduce the gap between slow performances and real-time architectures.All our code,
demos and pretrained architectures have been released under an open-source license in our
public repository.
8. 2
CHAPTER-1
INTRODUCTION
In the running world, there is growing demand for the software systems to recognize images when
information is scanned through Google vision API. Image recognition, in the context of machine vision, is the
ability of software to identify objects, places, people, writing and actions in images. Computers can use
machine vision technologies in combination with a camera and artificial intelligence software to achieve
image recognition.While human and animal brains recognize objects with ease, computers have difficulty
with the task. Inorder to overcome this problem faced by computer softwares are used.Current and future
applications of image recognition include smart photo libraries, targeted advertising, the interactivity of media,
accessibility for the visually impaired and enhanced research capabilities. Google, Facebook, Microsoft
and Apple are some of the major companies which uses this technology.Facebook can now perform face
recognize at 98% accuracy which is comparable to the ability of humans. Facebook can identify your friend’s
face with only a few tagged pictures. The efficacy of this technology depends on the ability to classify images.
Classification is pattern matching with data. Images are data in the form of 2-dimensional matrices. In fact,
image recognition is classifying data into one category out of many. One common and an important example
is optical character recognition (OCR). OCR converts images of typed or handwritten text into machine-
encoded text.The major steps in image recognition process are gather and organize data, build a predictive
model and use it to recognize images.
Furthermore, the human accuracy for classifying an image of a face in one of 7 different emotions is 65% ±
5% . One can observe the difficulty of this task by trying to manually classify the FER-2013 dataset images
within the following classes {“angry”, “disgust”, “fear”, “happy”, “sad”, “surprise”, “neutral”}.
Gender classification was first perceived as an issue in psychophysical studies; it focuses on the efforts of
understanding human visual processing and identifying key features used to categorize between male and
female individuals [1]. Research has shown that the disparity between facial masculinity and femininity can
be utilized to improve performances of face recognition applications in biometrics, human–computer
interactions, surveillance, and computer vision. However, in a real-world environment, the challenge is how to
deal with the facial image being affected by the variance in factors such as illumination, pose, facial
expression, occlusion, background information, and noise dependent on the type of classifier chosen, which is
in turn dependent on the feature extraction method applied.
9. 3
It is difficult to find a classifier that combines best with the chosen feature extractor such that an optimal
classification performance is achieved. Any changes to the problem domain require a complete redesign of the
system. The convolutional neural network (CNN) is a neural network variant that consists of a number of
convolutional layers alternating with subsampling layers and ends with one or more fully connected layers in
the standard multilayer perceptron (MLP). A significant advantage of the CNN over conventional approaches
in pattern recognition is its ability to simultaneously extract features, reduce data dimensionality, and classify
in one network structure. Such a structure, as illustrated in Figure 1, can boost recognition accuracy efficientl
yand cost-effectively.
This is then also the challenge in the development of a robust face-based gender classification system that has
high classification accuracy and real-time performance. The conventional approach applied in face recognition,
including face-based gender recognition, typically involves the stages of image acquisition and processing,
dimensionality reduction, feature extraction, and classification, in that order. Prior knowledge of the
application domain is required to determine the best feature extractor to design.
10. 4
CHAPTER-2
ABOUT PROJECT
2.1PURPOSE
The main purpose of Image Recognition system based on a grid infrastructure is to perform Image Analysis,
document processing of electronic document formats converted from paper formats more effectively and
efficiently. This improves the accuracy of recognizing the characters during document processing compared
to various existing available character recognition methods.The primary objective is to speed up the process
of character recognition in document processing. As a result the system can process huge number of
documents with-in less time and hence saves the time.This application can be used as a quick search engine
which doesn’t need any long they typing for searching anything and quicken our work.Since our image
recognition is based on a grid infrastructure, it aims to recognize multiple images and characters that belong to
different universal languages with different properties,font properties and alignments.
2.2PROJECT SCOPE
The scope of our project Image Recognition on a grid infrastructure is to provide an efficient and
enhanced software tool for the users to perform Document Image Analysis, document processing
by reading and recognizing the characters in research, academic, governmental and business
organizations that are having large pool of documented, scanned images. Irrespective of the size
of documents and the type of characters in documents, the product is recognizing them, searching
them and processing them faster according to the needs of the environment.
2.3EXISTING SYSTEM
In the running world there is a growing demand for the users to convert the images,printed documents for
identifying the content within them and process it to understand what it means . Hence the google lens system
was invented to convert the images and data available on papers in to computer process able documents and
images .Google Lens is an image recognition mobile app developed by Google. First announced
during Google I/O 2017, it is designed to bring up relevant information using visual analysis.When directing
the phone's camera at an object, Google Lens will attempt to identify the object or read labels and text and
show relevant search results and information. For example, when pointing the device's camera at a Wi-Fi label
containing the network name and password, it will automatically connect to the Wi-Fi source that has been
11. 5
scanned. Lens is also integrated with the Google Photos and Google Assistant apps.The service is similar
to Google Goggles, a previous app that functioned similarly but with lesser capability.Lens uses more
advanced deep learning routines, similar to other apps like Bixby Vision (for Samsung devices released 2016
and after) and Image Analysis Toolset (available on Google Play); artificial neural networks are used to detect
and identify objects, landmarks and to improve optical character recognition (OCR) accuracy
2.4 DRAWBACK OF EXISTING SYSTEM
The drawback in the Google lens include Device support is limited, although it is not clear
which devices are not supported or why. It requires Android Marshmallow (6.0) or newer.It is not
available in India too.
2.5 PROPOSED SYSTEM
Our proposed system is on a grid infrastructure which is a image and character recognition
system that supports recognition of the images and characters. This feature is what we call grid
infrastructure which eliminates the problem of heterogeneous character recognition and supports
multiple functionalities to be performed on the documents and images.
2.6 BENEFIT OF PROPOSED SYSTEM
The benefit of proposed system that overcomes the drawback of the existing system is that it supports Mobile
application.Image recognition with Google vision API and Google lens identifies famous
personalities,animals,the actions being performed within the image.Text understanding and text retrieval are
used to extract images and texts within surrounding and identify them with the help of Google lens and vision
API.
12. 6
2.7 SYSTEM SPECIFICATIONS
Hardware requirements:
1 System : Snapdragon 410
2 Hard disk : 8 GB
3 Floppy drive :Not required
4 Monitor : Any
5 Ram : 512MB
Software requirements:
● Operating system :Android
● Coding language : Java & XML
● Data base : Not required
● API : 21
● SDK Version:3.2.1
13. 7
CHAPTER 3
3.1 TOOLS AND TECHNOLOGY
3.1.1Android Studio:
Android Studio is the official integrated development environment (IDE) for Google's Android operating
system, built on JetBrains' IntelliJ IDEA software and designed specifically for Android development.[8]
It is
available for download on Windows, macOS and Linux based operating systems.[9][10]
It is a replacement for
the Eclipse Android Development Tools (ADT) as the primary IDE for native Android application
development.
Android Studio was announced on May 16, 2013 at the Google I/O conference. It was in early access preview
stage starting from version 0.1 in May 2013, then entered beta stage starting from version 0.8 which was
released in June 2014. The first stable build was released in December 2014, starting from version 1.0.The
current stable version is 3.2.1, which was released in October 2018
Features
Gradle-based build support
Android-specific refactoring and quick fixes
Lint tools to catch performance, usability, version compatibility and other problems
ProGuard integration and app-signing capabilities
Template-based wizards to create common Android designs and components
A rich layout editor that allows users to drag-and-drop UI components, option to preview layouts on
multiple screen configurations
Support for building Android Wear apps
Built-in support for Google Cloud Platform, enabling integration with Firebase Cloud Messaging (Earlier
14. 8
Android Virtual Device (Emulator) to run and debug apps in the Android studio.
Android Studio supports all the same programming languages of IntelliJ, and CLion e.g.Java (programming
language), and C++; and Android Studio 3.0 or later supports Kotlin and Java 7 language features and a
subset of Java 8 language features that vary by platform version."External projects backport some Java 9
features.
3.1.2Google vision API
Google Vision API allows developers to easily integrate vision detection features within applications,
including image labeling, face and landmark detection, optical character recognition (OCR), and tagging of
explicit content. ... Cloud AutoMLVision enables you to create a custom machine learning model for image
labeling
How to use Google Vision API?
Recently, we covered how computers can see, hear, feel, smell, and taste. One of the ways your code can
“see” is with the Google Vision API. Google Vision API connects your code to Google’s image recognition
capabilities. You can think of Google Image Search as a kind of API/REST interface to images.google.com,
but it does much more than show you similar images.
Google Vision can detect whether you’re a cat or a human, as well as the parts of your face. It tries to detect
whether you’re posed or doing something that wouldn’t be okay for Google Safe Search—or not. It even tries
to detect if you’re happy or sad
16. 10
The Architecture of the optical character recognition system on a grid infrastructure consists of
the three main components. They are:-
Scanner
OCR Hardware or Software
Output Interface
3.3PROBLEM STATEMENT
The problem here is for the software systems to recognize characters in computer system when
information is scanned through paper documents as we know that we have number of newspapers
and books which are in printed format related to different subjects. Whenever we scan the
documents through the scanner, the documents are stored as images such as jpeg, gif etc., in the
17. 11
computer system. These images cannot be read or edited by the user. But to reuse this
information it is very difficult to read the individual contents and searching the contents form
these documents line-by-line and word-by-word. These days there is a huge demand in “storing
the information available in these paper documents in to a computer storage disk and then later
editing or reusing this information by searching process”.
3.4MODULES AND THEIR FUNCTIONALITIES
Our software system Optical Character Recognition on a grid infrastructure can be divided
into five modules based on its functionality.The modules classified are as follows:-
● Document Processing Module
● System Training Module.
● Document Recognition Module.
● Document Editing Module and
● Document Searching Module.
3.4.1 DOCUMENT PROCESSING MODULE
This module is accessed by administrator whose role in our application is a librarian.This module
perform certain activities such as scanning documents, storing them as images, recognizing
characters in images to transfer them into word format. During the recognition process, this
module uses the OCR methodology in support of grid infrastructure datastructure. The module
supports the following services:-
Scanning printed documents.
Storing the documents as snapshots or images.
Processing those image-based documents.
Converting these image-based documents into e-documents(also called structured
documents).
Recognizing the characters in documents.
Generating grid infrastructure datastructure.
18. 12
3.4.2 DOCUMENT RECOGNITION MODULE
This module can be accessed by both the administrator and the end-user. Once the printed
documents are converted into structured documents, any user can recognize the characters present
in the document. That means the user can recognize the characters of any language he chooses
which makes OCR more flexible. This flexibility is due to the adaptation of grid infrastructure.
This is the module where the main functionality of OCR is tested.
Under this module, there are two types of recognition. They are handwritten recogniiton
and scanned document recognition.
In handwritten recognition, the handwriting of the user in any language is trained to the system
only for the first time. From there on-wards, the system recognizes the characters or words
written by the user. Thus handwritten document recognition recognizes the human handwriting.
In scanned document recognition, the system is first trained with the font characters in the
document in the training module itself. Now in the recognition module, the system takes the
scanned documents image as an input file, first crops the image and then extracts/recognizes the
characters from the document and makes these documents editable and searchable. Thus the
scanned document recognition recognizes the chracters from thescanned document image and
makes the document editable and searchable. Hence the document recogniiton module on a whole
supports the following services:-
Converts the document into specific format
Recognizes the characters
Heterogeneous character Recognition
3.4.3 DOCUMENT EDITING MODULE
This module can be accessed by both the administrator and the end-user during document editing
to implement the character recogniiton process. Once the scanned documents are stored, they
reside in computer memory. This data resides in the form of an image that is just viewable in an
image viewer. Hence, the document is first coverted into a form such that it is editable. The
desired form of the document may be MS-Word,Text,… as specified by the user.The objective of
this module is to let the user perform :-
19. 13
Addition of specific content to the documents
Deletion of certain content from documents
Any other modification of documents.
3.4.4 DOCUMENT SEARCHING MODULE
This module can be accessed by both the administrator and the end-user during the search of the
user required document to implement the character recogniiton process on it. The user requests
the system to search for a particular document. Then the system finds the documents based on
OCR methodology and returns the result of the search to the user.
20. 14
SYSTEM STUDY
CHAPTER 4
We propose two models which we evaluated in accordance to their test accuracy and number of parameters.
Both models were designed with the idea of creating the best accuracy over number of parameters ratio.
Reducing the number of parameters help us overcoming two important problems. First, the use of small CNNs
alleviate us from slow performances in hardware-constrained systems such robot platforms. And second, the
reduction of parameters provides a better generalization under an Occam’s razor framework. Our first model
relies on the idea of eliminating completely the fully connected layers. The second architecture combines the
deletion of the fully connected layer and the inclusion of the combined depth-wise separable convolutions and
residual modules. Both architectures were trained with the ADAM optimizer [8]. Following the previous
architecture schemas, our initial architecture used Global Average Pooling to completely remove any fully
connected layers. This was achieved by having in the last convolutional layer the same number of feature
maps as number of classes, and applying a softmax activation function to each reduced feature map. Our
initial proposed architecture is a standard fully-convolutional neural network composed of 9 convolution
layers, ReLUs [5], Batch Normalization [7] and Global Average Pooling. This model contains approximately
600,000 parameters. It was trained on the IMDB gender dataset, which contains 460,723 RGB images where
each image belongs to the class “woman” or “man”, and it achieved an accuracy of 96% in this dataset. We
also validated this model in the FER-2013 dataset. This dataset contains 35,887 grayscale images where each
image belongs to one of the following classes {“angry”, “disgust”, “fear”, “happy”, “sad”, “surprise”,
“neutral”}. Our initial model achieved an accuracy of 66% in this dataset. We will refer to this model as
“sequential fully-CNN”. Our second model is inspired by the Xception [1] architecture. This architecture
combines the use of residual modules [6] and depth-wise separable convolutions [2]. Residual modules
modify the desired mapping between two subsequent layers, so that the learned features become the
difference of the original feature map and the desired features. Consequently, the desired features H(x) are
modified in order to solve an easier learning problem F(X) such that:
H(x) = F(x) + x (1)
Since our initial proposed architecture deleted the last fully connected layer, we reduced further the
amount of parameters by eliminating them now from the convolutional layers. This was done trough the use
of depth-wise separable convolutions. Depth-wise separable convolutions are composed of two different layers:
depth-wise convolutions and pointwise convolutions. The main purpose of these layers is to separate the
spatial cross-correlations from the channel crosscorrelations [1]. They do this by first applying a D × D
21. 15
filter on every M input channels and then applying N 1 × 1 × M convolution filters to combine the M input
channels into N output channels. Applying 1 × 1 × M convolutions combines each value in the feature map
without considering their spatial relation within the channel. Depth-wise separable convolutions reduces the
computation with respect to the standard convolutions by a factor of 1 N + 1 D2 [2]. A visualization of the
difference between a normal Convolution layer and a depth-wise separable convolution can be observed in
Figure 4 b.
Our final architecture is a fully-convolutional neural network that contains 4 residual depth-wise separable
convolutions where each convolution is followed by a batch normalization operation and a ReLU activation
function. The last layer applies a global average pooling and a soft-max activation function to produce a
prediction. This architecture has approximately 60,000 parameters; which corresponds to a reduction of 10×
when compared to our initial naive implementation, and 80× when compared to the original CNN. Figure 4a
displays our complete final architecture which we refer to as mini-Xception. This architectures obtains an
accuracy of 95% in gender classification task. Which corresponds to a reduction of one percent with respect to
our initial implementation. Furthermore, we tested this architecture in the FER-2013 dataset and we obtained
the same accuracy of 66% for the emotion classification task. Our final architecture weights can be stored in
an 855 kilobytes file. By reducing our architectures computational cost we are now able to join both models
and use them consecutively in the same image without any serious time reduction.
22. 16
Fig. 4 a : Our proposed model for real-time classification.
23. 17
SOFTWARE DESIGN
4.1 DATA FLOW DIAGRAM
The DFD is also called as bubble chart. A data-flow diagram (DFD) is a graphical
representation of the "flow" of data through an information system. DFD’s can also be used for
the visualization of data processing. The flow of data in our system can be described in the form
of dataflow diagram as follows:-
1.Firstly, if the user is administrator he can initialize the following actions:-
● Document processing
● Document search
● Document editing.
All the above actions come under 2cases.They are described as follows:-
a)If the printed document is a new document that is not yet read into the system, then the
document processing phase reads the scanned document as an image only and then produces the
document image stored in computer memory as a result.
Now the document processing phase has the document at its hand and can read the document at
any point of time. Later the document processing phase proceeds with recognizing the document
using OCR methodology and the grid infrastructures. Thus it produces the documents with the
recognized characters as final output which can be later searched and edited by the end-user or
administrator.
b)If the printed document is already scanned in and is held in system memory, then the document
processing phase proceeds with document recognition using OCR
methodology and grid infrastructure. And thus it finally produces the document with recognized
documents as output.
24. 18
1. If the user using the OCR system is the end-user, then he can perform the following
actions:-
● Document searching
● Document editing
1. Document Searching:- The documents which are recognized can be searched by the
user whenever required by requesting from the system database.
Document Editing:- The recognized documents can be edited by adding the specific content to
the document, deleting specific content from the document and modifying the document
methodology and grid infrastructure. And thus it finally produces the document with recognized
documents as output.
2. If the user using the OCR system is the end-user, then he can perform the following
actions:-
● Document searching
● Document editing
2. Document Searching:- The documents which are recognized can be searched by the
user whenever required by requesting from the system database.
3. Document Editing:- The recognized documents can be edited by adding the specific
content to the document, deleting specific content from the document and modifying the
document.
4.1 UML DIAGRAMS
UML combines best techniques from data modeling (entity relationship diagrams), business
modeling (work flows), object modeling, and component modeling. It can be used with all
processes, throughout the software development life cycle, and across different implementation
technologies. UML has 14 types of diagrams divided into two categories. Seven diagram types
represent structural information, and the other seven represent general types of behavior,
including four that represent different aspects of interactions. Some of these diagrams we
provided to describe the design and implementation of our OCR system can be categorized
hierarchically as below:-
Use case diagram
25. 19
Class diagram
Sequence diagram
Collaboration diagram
Activity diagram
Component diagram
Deployment diagram
4.2.1 USE-CASE DIAGRAMS
Our software system can be used to support library environment to create a Digital Library
where several paper documents are converted into electronic-form for accessing by the users. For
this purpose the printed documents must be recognized before they are converted into
electronic-form. The resulting electronic-documents are accessed by the users likefaculty and
students for reading and editing. Now according to this information, the following are the
different actors involved in implementing our OCR system:-
If we consider for virtual digital library, the Administrator can be the Librarian and
the End-users can be Students or/and Faculty.
The following are the list of use diagrams that altogether form the complete or the
overall use-case diagram. They are listed below:-
1. Use-case diagram for document processing
2. Use-case diagram for neural network training
3. Use-case diagram for document recognition
4. Use-case diagram for document editing
5. Use-case diagram for document searching
In each of the use-case diagrams below we clearly explained about that particular use- case
functionality. In this we provided a description about the
● Use-case name
● Details about the use-case
● Actors using this use-case
26. 20
Use case Name
Neural Network Training
Description
The Administrator or End-user enters the specific characters required for training. User
stores them as image file and trains the system.
Actors
○ Primary Actor : Administrator or End-user
○ Secondary Actor : User
Flow of Events
1. The user enters the specific characters in order to train the system.
2. After entering it is stored as image file.
3. Finally trains the system according to the system.
Pre-Condition
The font in the scanned document should be identified.
27. 21
Open document in editor
Select Edit action
Administrator or
End-user
Performs editing
Stores edited document
Figure 4.2.1 Use-Case Diagram For Document Editing
Actors
○ Primary Actor : Administrator or End-user
○ Secondary Actor : User
Flow of Events
1. The user opens the document for searching a word he required.
2. After opening the document he enters the word for search.
28. 22
<<includes>>
scan documents
Document processing
administrator <<includes>>
store documents
Document recognition
Trainsthesystem
end-use
Document processing
3. Finally searches the word in that document.
Pre And Post Conditions
No pre-condition and post-condition
Overall Use-Case Diagram
4.2.2 CLASS DIAGRAMS
The class diagram is the main building block in object oriented modeling. The classes in a class
diagram represent both the main objects and or interactions in the application and the objects to
be programmed.
● The class diagram of our OCR system consists of 9classes. They are
29. 23
1. MainScreen
2. Editor
3. HelpFrame
4. Document
5. HEntry
6. Entry
7. TrainingSet
8. KohonenNetwork
9. PrintedFrame.
Among all these classes the MainScreen is the main class that represents all the major
functions carried out by our OCR system. The MainScreen class has an association with
five classes viz., Editor, HelpFrame, Document, TrainingSet, PrintedFrame. And the
TrainingSet class in-turn has an association with the HEntry and the KohonenNetwork
classes. The PrintedFrame has an association with the Entry and KohonenNetworkclasses.
30. 24
Figure 4.4.2:Class Diagram
4.2.3 SEQUENCE DIAGRAMS
Sequence diagrams are sometimes called Event-trace diagrams, event scenarios, and timing
diagrams. A sequence diagram shows, as parallel vertical lines (lifelines), different processes or
objects that live simultaneously, and, as horizontal arrows, the messages exchanged between them,
in the order in which they occur. This allows the specification of simple runtime scenarios in a
graphical manner.
In sequence diagram, the class objects that are used to describe the interaction between various
classes vary from one function to another function. There are five sequence diagrams short-listed
below for presenting the sequence of actions performed by each of the five modules. The key
class object involved in all of these module functions is MainScreen class which controls the
interaction among various class objects.
31. 25
Sequence Diagram for Document Processing
1. Objects
Administrator - “a”
MainScreen - “m”
Document - “d”
SystemMemory - “s”
2. Links
1. Administrator object to MainScreen object.
2. MainScreen object to Document object.
3. Document object to SystemMemory object.
4. SystemMemory object to Administrator object.
3. Messages
1. Process documents
2. Scan documents
3. Scans
4. Stores documents
5. Stores
6. Returns the processed documents
7. Returns
8. End
9. Processed Document
33. 27
1.Specifies the fontcharacters
2.Stores it as an image
3.Trains the system with new font
4.System recognizes new font and returns for user
t:TrainingSets:Systema:Administrato
4. System recognizes new font and returns for user
Figure 4.2.3.a:Sequence Diagram for Training
Sequence Diagram for Document Recognition
1. Objects
Administrator - “a”
MainScreen - “m”
SystemMemory - “s”
TrainingSet - “t”
2. Links
1. Administrator object to MainScreen object
2. MainScreen object to SystemMemory object
3. SystemMemory object to MainScreen object
34. 28
4. TrainingSet object to MainScreen object
5. MainScreen object to Administrator object
3. Messages
1. Recognize documents
2. Store processed document
3. Read file image
4. Recognize using ocr
5. Send processed document
6. Recognize the characters
Figure 4.4Sequence Diagram for Recognition
1:Recognise
2. Store processed
document
4.Recognise using ocr
5.Send processed
6.Recognise the
t:TrainingSes:SystemMemorya:Administrator m:MainScreen
36. 30
5. Modifying document
6. Modifies
7. Stores the edited documents
8. Administrator accesses the edited documents
Figure 4.2.5:Sequence Diagram for Editing
1.Edit
2.Adding
3.adds
5.Deletes
4.Deleting
6.Modifing
7.Modifies
8.Stores the edited documents
9.Administrator accesses the edited documents
s:SystemMemord.Documenta:Administrator m:MainScreen
37. 31
CHAPTER 5
MODULES
5.1 Main Modules:
2 Face detection
3 Feature Extraction
4 Recognition
5.2 Face Detection:
Viola-Jones object detection framework is the first and one of the most mature frame Work to provide
competitive object detection rates in real-time It is a binary classification problem by implementing an
Adaboost classifier with Haar-like features.
5.3 Feature Extraction:
One possible classification divides the feature extraction methods into Holistic Methods and Local Feature-
based Methods. In the first method the whole face image is applied as an input of the recognition operation
similar to the well-known PCA-based method which was used in Kiby and Sirovich [5] followed by Turk and
Pentland [6]. In the second method local features are extracted, for example the location and local statistics of
the eyes, nose and mouth are used in the recognition task.
5.4 Recognition :
The facial recognition module is used to automatically identify people by their video images. It recognizes
faces captured by Axxon facial detection tool by comparing their parameters with digital templates stored in a
dedicated database.
38. 32
CHAPTER 6
SYTEM TESTING
Results of the real-time emotion classification task in unseen faces can be observed in
Figure 8(a). Our complete realtime pipeline including: face detection, emotion and
gender classification have been fully integrated in our Care-O-bot 3 robot. An
example of our complete pipeline can be seen in Figure 8(b) in which we provide
emotion and gender classification. In Figure 7 we provide the confusion matrix results
of our emotion classification mini-Xception model. We can observe several common
misclassifications such as predicting “sad” instead of “fear” and predicting “angry”
instead “disgust”. A comparison of the learned features between several emotions and
both of our proposed models can be observed in Figure 8(c). The white areas in figure
8(d) correspond to the pixel values that activate a selected neuron in our last
convolution layer. The selected neuron was always selected in accordance to the
highest activation. We can observe that the CNN learned to get activated by
considering features such as the frown, the teeth, the eyebrows and the widening of
one’s eyes, and that each feature remains constant within the same class. These results
reassure that the CNN learned to interpret understandable human-like features, that
provide generalizable elements. These interpretable results have helped us understand
several common misclassification such as persons with glasses being classified as
“angry”. This happens since the label “angry” is highly activated when it believes a
person is frowning and frowning features get confused with darker glass frames.
Moreover, we can also observe that the features learned in our mini-Xception model
are more interpretable than the ones learned from our sequential fully-CNN.
Consequently the use of more parameters in our naive implementations leads to less
robust features.
39. 33
Fig. 6.1: Normalized confusion matrix of our mini-Xception network.
Fig.6.2 Results of the provided real-time emotion classification provided in our public repository
40. 34
CHAPTER 7
Source Code
7.1 TEXT RECOGNITION
public static void detectText(String filePath, PrintStream out) throws Exception, IOException
{ List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));
Image img = Image.newBuilder().setContent(imgBytes).build();
Feature feat = Feature.newBuilder().setType(Type.TEXT_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
try (ImageAnnotatorClient client = ImageAnnotatorClient.create())
{ BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
for (AnnotateImageResponse res : responses)
{ if (res.hasError()) {
out.printf("Error: %sn", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
for (EntityAnnotation annotation : res.getTextAnnotationsList()) {
out.printf("Text: %sn", annotation.getDescription());
out.printf("Position : %sn", annotation.getBoundingPoly());
}
}
}
}
7.2 FACE RECOGNITION
public static void detectFaces(String filePath, PrintStream out) throws Exception, IOException
{ List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));
41. 35
Image img = Image.newBuilder().setContent(imgBytes).build();
Feature feat = Feature.newBuilder().setType(Type.FACE_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
try (ImageAnnotatorClient client = ImageAnnotatorClient.create())
{ BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
for (AnnotateImageResponse res : responses)
{ if (res.hasError()) {
out.printf("Error: %sn", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
for (FaceAnnotation annotation : res.getFaceAnnotationsList()) {
out.printf(
"anger: %snjoy: %snsurprise: %snposition: %s",
annotation.getAngerLikelihood(),
annotation.getJoyLikelihood(),
annotation.getSurpriseLikelihood(),
annotation.getBoundingPoly());
}
}
}
}
42. }
}
}
36
7.3 LANDMARK DETECTION
public static void detectLandmarks(String filePath, PrintStream out) throws Exception,
IOException {
List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));
Image img = Image.newBuilder().setContent(imgBytes).build();
Feature feat = Feature.newBuilder().setType(Type.LANDMARK_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
try (ImageAnnotatorClient client = ImageAnnotatorClient.create())
{ BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
for (AnnotateImageResponse res : responses)
{ if (res.hasError()) {
out.printf("Error: %sn", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
for (EntityAnnotation annotation : res.getLandmarkAnnotationsList()) {
LocationInfo info = annotation.getLocationsList().listIterator().next();
out.printf("Landmark: %sn %sn", annotation.getDescription(), info.getLatLng());
43. }
}
}
37
}
7.4 LABEL DETECTION
public static void detectLabelsGcs(String gcsPath, PrintStream out) throws Exception,
IOException {
List<AnnotateImageRequest> requests = new ArrayList<>();
ImageSource imgSource = ImageSource.newBuilder().setGcsImageUri(gcsPath).build();
Image img = Image.newBuilder().setSource(imgSource).build();
Feature feat = Feature.newBuilder().setType(Type.LABEL_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
try (ImageAnnotatorClient client = ImageAnnotatorClient.create())
{ BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
for (AnnotateImageResponse res : responses)
{ if (res.hasError()) {
out.printf("Error: %sn", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
for (EntityAnnotation annotation : res.getLabelAnnotationsList()) {
annotation.getAllFields().forEach((k, v) ->
out.printf("%s : %sn", k, v.toString()));
}
46. CHAPTER 9
CONCLUSION
We have proposed and tested a general building designs for creating real-time CNNs. Our proposed
architectures have been systematically built in order to reduce the amount of parameters. We began by
eliminating completely the fully connected layers and by reducing the amount of parameters in the remaining
convolutional layers via depth-wise separable convolutions. We have shown that our proposed models can be
stacked for multi-class classifications while maintaining real-time inferences. Specifically, we have developed
a vision system that performs face detection, gender classification and emotion classification in a single
integrated module. We have achieved human-level performance in our classifications tasks using a single
CNN that leverages modern architecture constructs. Our architecture reduces the amount of parameters 80×
while obtaining favorable results. Our complete pipeline has been successfully integrated in a Care-O-bot 3
robot. Finally we presented a visualization of the learned features in the CNN using the guided back-
propagation visualization. This visualization technique is able to show us the high-level features learned by
our models and discuss their interpretability.
40
47. CHAPTER 10
REFERENCES
1. Francis Chollet. Xception: Deep learning with depthwise separable
convolutions. CoRR, abs/1610.02357, 2016.
2. Andrew G. Howard et al. Mobilenets: Efficient convolutional neural networks
for mobile vision applications. CoRR, abs/1704.04861, 2017.
3. Dario Amodei et al. Deep speech 2: End-to-end
speech recognition in english and mandarin. CoRR, abs/1512.02595, 2015.
4. Ian Goodfellow et al. Challenges in Representation Learning: A report on three
machine learning contests, 2013.
5. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural
networks. In Proceedings of the Fourteenth International Conference on artificial
Intelligence and Statistics, pages 315–323,
2011.
6. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778,2016
7. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In International
Conference on Machine Learning, pages 448–456, 2015
41