A Chinese character has many different forms, information in its features will have many variations. Therefore, it needs a relational database to store many variations of their features. The use of the relational database to store the sets of features enables the use of distance measurements methods while measuring the sets of feature that owned by a Chinese character to recognize a Chinese character image inputted. The feature used in this thesis is the pixel population matrix. The sets of the features are stored and queried by using the relational database.
This paper discusses about how to recognize the Chinese character image and Chinese radical image by using relational database and pixel population matrix.
A CHINESE CHARACTER RECOGNITION METHOD BASED ON POPULATION MATRIX AND RELATIONAL DATABASE
1. ISSN 1858-1633 @2006 ICTS
518
A CHINESE CHARACTER RECOGNITION METHOD BASED ON
POPULATION MATRIX AND RELATIONAL DATABASE
Teady Matius Surya Mulyana1)
and Agus Harjoko2)
1
Central Library, Petra Christian University, Surabaya, Indonesia
2
Electronic and Instrumentation Lab., FMIPA, Gadjah Mada University, Yogyakarta, Indonesia 55281
email : aharjoko@ugm.ac.id
ABSTRACT
A Chinese character has many different forms,
information in its features will have many
variations. Therefore, it needs a relational database
to store many variations of their features. The use of
the relational database to store the sets of features
enables the use of distance measurements methods
while measuring the sets of feature that owned by a
Chinese character to recognize a Chinese character
image inputted.
The feature used in this thesis is the pixel
population matrix. The sets of the features are
stored and queried by using the relational database.
This paper discusses about how to recognize the
Chinese character image and Chinese radical image
by using relational database and pixel population
matrix.
Keywords: Optical character recognition, relational
database, population matrix, Chinese character
Recognition.
1 INTRODUCTION
Han zhi or the han alphabets, also known as the
Kanji, have thousands of characters[1]. Every Kanji,
due to the evolution, has three models, namely the
ancient, traditional and popular characters. Each
model uses various fonts.
Kanji can be formed either from single
characters or the combinations of the single
characters that end as a new character. Those single
characters used to form Kanji are called ”radical”.
A single character can be either displayed in its
new form or in its original form when it is formed
into the radical of another Kanji. For example, the
character 心 will be displayed as 忄as the radical in
the character of 情, meanwhile, in the character of
意, the form retains its original form. The placement
of these radicals varies: on the right, in the middle,
on the left, either up or down. Another example is in
the character 唱. That character is formed from
the radical 口 on the left, and two radicals of 日 on
the top right and bottom right.
Low [2], explains the method to recognize the
alpahabet from an image by dividing the pixels of
the character’s image into matrix cells. The
percentage of the pixels towards the matrix cells
becomes the pattern vector used to identify a
character. Lu [3], explains the implementation of
database to store such features.
Based on Lu’s and Low’s idea, the authors
combine Low’s features of pixel population matrix
and the relational database to store and manipulate
data and stored features to accomodate the
variations of font from every Kanji needed in Kanji
recognition. The use of the relational database is
meant to ease the searching of Kanji stored in the
database from the input of radical image.
2 THE IMAGE PROCESSING
SYSTEM
The recognition of Kanji characters is based on
features of Kanji. Features used in the research is
the ratio between the width and height, and the pixel
population matrix[2]. The pixel population matrix
used is the 2x2, 3x3, 4x4 and 6x6 pixel, which can
be picked by the users. The features of the pixel
population matrix is stored in the matrix form of
12x12. The features of this pixel matrix is then
converted to the matrix desired by the users before
it is used to identify the image input. The features
are then stored in a relational database.
A Chinese character is recognized by computing
the distance between the character and the
characters in the database. There are a number of
similarity measures that are widely used in the
computer vision literatures[4,5,6,7,8], among them
are the Mahalanobis family of distance measures
which inlcudes the L1 metrix and the Euclidean
distance. In this research the L1 metric is used for
its simplicity. The L1-distance is expressed in
equation (1).
2. A Chinese Character Recognition Method Based On Population Matrix And Relational Database –
Teady Matius Surya Mulyana
ISSN 1858-1633 @2006 ICTS
519
∑
=
−=
n
l
ll hiHId
1
||),( (1)
The use of the relational database brings the
possibility for the use of distance measurment
towards the sets of features from several image that
a Kanji character might possess. The distance of the
image I to the image H in a Kanji sample are
represented by ds(I,H). The vakue of ds is obtained
from equation (2), derived from equation (1).
mshiHId
n
l
lls ...1,||),(
1
=−= ∑
=
(2)
Based on equation (1), six methods for distance
measurement are obtained. Those six methods are:
• The smallest distance, it is the smallest
distance between the Kanji images from a
Kanji with an image input This distance is
obtained from equation (3).
msdMINHId s
k
...1),(),( == (3)
• The average distance, it is the avarage distance
between the Kanji images from a Kanji
character with an image input. This avarage
distance is saught by using equation (4).
m
d
HId
m
s
s
r
∑
=
= 1
),( (4)
• The biggest distance, it is the biggest distance
between the Kanji images from a Kanji with an
image input. This distance is gained from
equation (5).
msdMAXHId s
b
...1),(),( == (5)
• The smallest range of each feature, the distance
obtained from the smallest range of each
featrues between the Kanji images in a Kanji
character with the same feature from an image
input. The smallest distance range of each
features is gained through equation (6).
∑
=
=−=
n
l
sll
u
mshiMINHId
1
...1),|(|),( (6)
• The average distance range of each feature,
that is the distance obtained from the average
difference of each feature between the images
of a Kanji with the same features of the same
Kanji from an image input. This average range
of each feature is resulted from equation (7)
∑
∑
=
=
−
=
n
l
m
s
sll
v
m
hi
HId
1
1
||
),( (7)
• The biggest range of each feature that is the
distance resulted from the biggest range of
each feature between the images of a Kanji
with the same features from an image input.
The biggest distance range from each feature is
resulted from equation (8).
∑
=
=−=
n
l
slls
w
mshiMAXHId
1
...1),|(|),( (8)
All the smallest distance, average and the
biggest distance are obtained by computing the
distance of images a Kanji character has. One of the
distances resulted from the implementation will be
used according to the previously desired method.
Kindly examine the illustration on Figure 1.
The smallest, average, and the biggest distance
range of each feature are implemented by
computing the difference between each Kanji’s
feature and the same features from the image input.
The result will then be considered as the distance
between the Kanji’s feature and the Kanji’s
image-input. Furthermore, the result is summed up
and becomes the distance between Kanji and the
Kanji’s image input. Further illustration can be
obsereved on Figure 2.
After the distance for each Kanji and Kanji’s
image input is obtained, by using one of the sixth
methods, kanji that has the smallest distance range
toward the kanji’s image input will be treated as the
kanji recognized as the inputted image.
The entity relation is as shown on Figure 3. A
kanji character has several Kanji images with their
features. Every Kanji image uses a font. Moreover,
every Kanji’s image also has several radical
images.
The Kanji search by using the radical image
input is done by using the distance between radical
images possessed by a Kanji character. The
distance calculation is done by using the smallest
distance range. It is done so due to the different
radical images a Kanji character could possess. The
Kanji found is then the one that has radical image
with distance on certain treshold towards the image
input.
If every radical image associates to certain
radical entity as a reference for a one-to-many
relation (one is for radical and many is for radical
images), radical reference will be able to be use as
the feature of Kanji’s image; however, this research
3. 2nd Information and Communication Technology Seminar, August 2006
ISSN 1858-1633 @2006 ICTS
520
does not expect any radical features. Therefore,
relation diagram. there is no radical entities added
to the entity relation diagram
kanji 章
36 51 34 width 78
I 42 55 42 height 81
26 48 26
width : height
H
simsun ds(I,H)
16 38 22 width 87 20 13 12 d1(I,H)
21 24 22 height 87 21 31 20 → 160,037037
13 29 15 → 13 19 11
width : height
Batang
21 36 25 width 81 15 15 9 d2(I,H) dk
= 71,023529
19 27 22 height 88 23 28 20 → 168,0795455 dr
= 133,0467
15 29 18 → 11 19 8 db
= 168,07955
width : height
Simhei
39 58 38 width 83 3 7 4 d3(I,H)
39 50 41 height 85 3 5 1 → 71,02352941
32 59 35 → 6 11 9
width : height 0,976470588 22,02352941
章
1 0,037037037
章
0,920454545 20,07954545
章
章 0,962962963
↓
|I-H|
Figure 1. The Illustration of smallest avarage and biggest distance method
kanji 章
36 51 34 width 78
I 42 55 42 height 81
26 48 26
width : height
H
Simsun d
u
(I,H)
16 38 22 width 87 20 13 12 3 7 4
21 24 22 height 87 21 31 20 3 5 1 d
u
= 48,04
13 29 15 → 13 19 11 6 11 8
width : height
Batang d
v
(I,H)
21 36 25 width 81 15 15 9 13 11,7 8
19 27 22 height 88 23 28 20 16 21,3 14 d
v
= 133
15 29 18 → 11 19 8 10 16,3 9
width : height
Simhei d
w
(I,H)
39 58 38 width 83 3 7 4 20 15 12
39 50 41 height 85 3 5 1 23 31 20 d
w
= 186
32 59 35 → 6 11 9 13 19 11
width : height 0,976470588 22,02352941
0,037037037
22,02352941
0,920454545 20,07954545 14,04670397
章
章
1 0,037037037
章
章
0,962962963
↓
|I-H|
Figure 2.The illustration of each feature distance-range method
4. A Chinese Character Recognition Method Based On Population Matrix And Relational Database –
Teady Matius Surya Mulyana
ISSN 1858-1633 @2006 ICTS
521
Figure 3. Entity Relationship Diagram
measured in distance towards a set of radical image
features possessed by Kanji’s images of a specific
Kanji character.
3 RESULT
There are 51 kanji characters stored in the
database for the purpose of research as it is seen on
Table 1. Each Kanji has three images. Each Kanji’s
image has various radical images.
The test on the feature’s accuracy used and the
methods on distance measurement are shown on
Table 1. Stored Kanji Characters
No. Name
Ba
Tang
Sim
Sun
Sim
Hei
No. Name
Ba
Tang
Sim
Sun
Sim
Hei
No. Name
Ba
Tang
Sim
Sun
Sim
Hei
1 yin 音 音 音 18 hao 好 好 好 35 xiao 小 小 小
2 zhang 章 章 章 19 de 的 的 的 36 shao 少 少 少
3 Yi 意 意 意 20 dian 点 点 点 37 tu 土 土 土
4 Li 立 立 立 21 ren 人 人 人 38 ba 吧 吧 吧
5 Rui 瑞 瑞 瑞 22 ru 入 入 入 39 zai 在 在 在
6 Lin 林 林 林 23 ru 如 如 如 40 wen 文 文 文
7 Jing 京 京 京 24 shui 水 水 水 41 ting 听 听 听
8 Fei 飞 飞 飞 25 dong 东 东 东 42 jiao 叫 叫 叫
9 Men 们 们 们 26 xi 西 西 西 43 xie 谢 谢 谢
10 Men 门 门 门 27 nan 南 南 南 44 guang 光 光 光
11 Er 儿 儿 儿 28 bei 北 北 北 45 ta 他 他 他
12 Er 而 而 而 29 kou 口 口 口 46 ta 她 她 她
13 He 河 河 河 30 chang 唱 唱 唱 47 ta 它 它 它
14 He 何 何 何 31 yin 因 因 因 48 jia 家 家 家
15 Ge 哥 哥 哥 32 hui 回 回 回 49 ai 爱 爱 爱
16 Ke 可 可 可 33 le 了 了 了 50 gu 古 古 古
17 Shi 是 是 是 34 bu 不 不 不 51 Ni 你 你 你
The entity of a Kanji’s image of a certain
Kanji characters, with its features, are used to
recognise a Kanji character by applying one of the
six distancee measurement methods. The models
in the radical images and Font Entity are used to
restrain the radical image being measured on its
distance.
The radical image entity is also used to search
a Kanji with certain radical image which is being
The test resulted in the matrix of 6 x 6 with
average distance measurement methods, the
smallest distance range on each features and the
average distance range of each features in this
method reaches the success rate of 100 %.
Matrix 3 x 3 and matrix 4 x 4 are considered to
be more successful with the smallest range
distance method. It is due to the characteristics of
the Kanji characters themselves that have cell
dispersion on the segmentation on the matrix of 3
x 3 or 4 x 4.
5. 2nd Information and Communication Technology Seminar, August 2006
ISSN 1858-1633 @2006 ICTS
522
The 2 x 2 matrix reaches the success rate of
only 5 to 20 %. This is the resulted from the
improper pixel divisions on all four matrixes to be
used as the feature of an image. Hence, the pixel
dispersion cannot be further detected.
Table 2. The Result of Kanji Character Recognition Test
min avg max min sel avg sel max sel
No Name
Kan
ji
2
x
2
3
x
3
4
x
4
6
x
6
2
x
2
3
x
3
4
x
4
6
x
6
2
x
2
3
x
3
4
x
4
6
x
6
2
x
2
3
x
3
4
x
4
6
x
6
2
x
2
3
x
3
4
x
4
6
x
6
2
x
2
3
x
3
4
x
4
6
x
6
1 bei 北 0 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1
2 de 的 1 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1
3 dian 点 1 1 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
4 dong 东 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1
5 er 儿 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1
6 er 而 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
7 fei 飞 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
8 ge 哥 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
9 hao 好 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1
10 he 何 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 1
11 he 河 0 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 0 0 1
12 jing 京 0 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1
13 ke 可 0 1 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1
14 li 立 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1
15 lin 林 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1
16 men 们 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
17 men 门 0 0 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1
18 nan 南 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
19 ren 人 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
20 Ru 入 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 0 0
Success 4
1
3
1
6
1
9
2 9
1
4
2
0
1 5 5
1
3
2 5
1
4
2
0
2 7
1
2
2
0
2 6 7
1
6
Success rate
2
0
%
6
5
%
8
0
%
9
5
%
1
0
%
4
5
%
7
0
%
1
0
0
%
5
%
2
5
%
2
5
%
6
5
%
1
0
%
2
5
%
7
0
%
1
0
0
%
1
0
%
3
5
%
6
0
%
1
0
0
%
1
0
%
3
0
%
3
5
%
8
0
%
The success rate of obtained using the radical
images is 73%. In this test the population matrix is
3x3. Due to the space limitation this result is not
presented in this paper, readers are encouraged to
consult [9]. The success rate can be very much
likely increased if the 6 x 6- feature matrix is used.
4 CONCLUSION
Based on the research upon the use of pixel
population matrix and realtional database to
recognize a kanji charcter and its radicals, the
writers come up with several conclusions:
1. Amidst the pixel population matrix of 2x2,
3x3, 4x4 and 6x6, it is the 6 x 6 pixel that is
more accurate to recognize the Kanji’s
image.
2. The cell features of the 2x2 pixel population
matrix are not sufficient to accomodate the
characterization of the pixel dispersion from
its whole four cells.
3. The sets of image features stored in a
relational database helps the implementation
of distance range measurement methods
which need the set themseleves in character
recognition.
There are several suggestions that the writers
can propose for the upcoming research. Those
suggestions are:
6. A Chinese Character Recognition Method Based On Population Matrix And Relational Database –
Teady Matius Surya Mulyana
ISSN 1858-1633 @2006 ICTS
523
1. The use of the six methods of distance range
measurement are also applicable for a
character recognition using other features.
2. There has not been any detailed research upon
the possible conditions that determine the
success of one method towards the others.
These methods can be used for further
research in the future.
REFERENCE
[1] Kasmito and Tanzil, J., Petunjuk Termudah
Belajar Mandarin, Binarupa Aksara, Jakarta,
Indonesia, 1997.
[2] Low, A., Introductory Computer Vision and
Image Processing, McGraw-Hill, Berkshire,
UK, 1991.
[3] Lu, G., Multimedia Database Management
Systems, Artech House, London, 1999.
[4] Gonzales, R.C and Woods, R.E. Digital
Image Processing, Addison-Wesley
Publishing Company, 1992.
[5] Hearn, D. and Baker, P.M., Computer
Graphics, Prentice Hall, USA, , 1986.
[6] Kastury, R. and C. Fain, R., Computer Vision:
Collection of Computer Vision Journal 1951
– 1991, IEEE Computer Society Press, Los
Amitos, CA, 1991.
[7] Steinmetz, R. dan Nahrstedt, K., Multimedia
Computing, Communication & Application –
Innovative Technology Series, Prentice Hall,
USA, 1995.
[8] Liu, C., Jaeger, S., Nakagawa, S., 2004,
“Online Recognition of Chinese Characters:
The State of The Art”, IEEE Transaction On
Pattern Analysis And Machine Intelligence,
26(2), 2004, 198 - 213.
[9] Mulyana, T.M.S., Penggunaan Matriks
Populasi Pixel Dan Relational Database
Untuk Mengenali Huruf Kanji Dan
Radikalnya, Masters Thesis, Computer
Science Study Program, Gadjah Mada
University, 2006.