Bulat Fatkulin - The Afghanistan chapter of the chinese online encyclopedia baidu as a subject for natural language processing tools applied for terminology extraction
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Bulat Fatkulin - The Afghanistan chapter of the chinese online encyclopedia baidu as a subject for natural language processing tools applied for terminology extraction
1. The “Afghanistan” chapter of the Chinese online encyclopedia
”Baidu” as a subject for natural language processing tools
applied for terminology extraction
Bulat Fatkulin, South Ural state university, Chelyabinsk, Russia
April 5, 2014
Variety of oriental
cultures surrounding
Russia is reflected in a
wide range of
Orientalistic branches
(Iranian studies,
Arabic studies,
Turkology, Indology,
Afgan studies etc.)
The Sinology occupies
a leading position
among them. All
major world
civilization centers,
including Russia and
China, have their own
versions of
orientalistics branches
and use their own
terminology. The
following reasons make
Afghan studies in
China actual:
Applied
Linguistics for
Chinese includes
a wide range of
specialized
programs such as:
1. segmenters
2. morphoanalizers
3. parsers
4. characters OCR
systems
There are
numerous
methods of
terminology
extraction from
large amounts of
text, called
corpora.
In our work we used
tools such as:'
&
$
%
Stanford Chi-
nese segmenter
http://nlp.
stanford.edu:
8080/parser/'
&
$
%
Shanghai Chi-
nese language
segmenter
http://hlt030.
cse.ust.hk/
research/
c-assert/
'
&
$
%
Automatic
annotation of
Chinese texts
http://www.
chinese-tools.
com
The program runs from
the command line by
means of this command:
segment.sh [-k] [ctb] | [pku]
<filename> <encoding> <size>
ctb: Chinese Treebank
pku: Beijing Univ.
比比比尔尔尔兼兼兼德德德高高高地地地,,,北北北部部部有有有厄厄厄
尔尔尔布布布兹兹兹山山山脉脉脉,,,德德德马马马万万万德德德峰峰峰
海海海拔拔拔5670米米米,,,为为为伊伊伊朗朗朗最最最高高高
峰峰峰。。。西西西部部部和和和 西西西南南南部部部是是是宽宽宽
阔阔阔的的的扎扎扎格格格罗罗罗斯斯斯山山山山山山系系系,,,约约约
占占占国国国土土土面面面积积积一一一半半半。。。中中中部部部为为为
干干干燥燥燥的的的盆盆盆地地地,,,形形形成成成许许许多多多沙沙沙
漠漠漠,,,有有有卡卡卡维维维尔尔尔荒荒荒漠漠漠与与与卢卢卢特特特
荒荒荒漠漠漠,,,平平平均均均海海海拔拔拔1,,,000余余余
米米米。。。仅仅仅西西西南南南部部部波波波斯斯斯湾湾湾沿沿沿岸岸岸
与与与北北北部部部里里里海海海 沿沿沿岸岸岸有有有小小小面面面积积积
的的的冲冲冲击击击平平平原原原。。。西西西南南南部部部扎扎扎格格格
罗罗罗斯斯斯山山山麓麓麓至至至波波波斯斯斯湾湾湾头头头的的的平平平
原原原称称称胡胡胡齐齐齐斯斯斯坦坦坦。。。
The same Chinese text
after the processing
segmenting has become
much more clear:
尔尔尔 兼兼兼德德德 高高高地地地 ,,, 北北北部部部 有有有
厄厄厄尔尔尔布布布兹兹兹 山山山脉脉脉 ,,, 德德德马马马万万万
德德德峰峰峰 海海海拔拔拔 5670 米米米 ,,, 为为为
伊伊伊朗朗朗 最最最高高高 峰峰峰 。。。 西西西部部部 和和和
西西西南南南部部部 是是是 宽宽宽阔阔阔 的的的 扎扎扎 格格格罗罗罗
斯斯斯 山山山山山山 系系系 ,,, 约约约占占占 国国国土土土 面面面
积积积 一一一半半半 。。。 中中中部部部 为为为 干干干燥燥燥 的的的
盆盆盆地地地 ,,, 形形形成成成 许许许多多多 沙沙沙漠漠漠 ,,,
有有有 卡卡卡维维维尔尔尔 荒荒荒漠漠漠 与与与 卢卢卢特特特 荒荒荒
漠漠漠 ,,, 平平平均均均 海海海拔拔拔 1,,,000余余余
米米米 。。。 仅仅仅 西西西南南南部部部 波波波斯斯斯湾湾湾 沿沿沿
岸岸岸 与与与 北北北部部部 里里里海海海 沿沿沿岸岸岸 有有有 小小小
面面面积积积 的的的 冲冲冲击击击 平平平原原原 。。。 西西西南南南
部部部 扎扎扎 格格格罗罗罗斯斯斯 山山山麓麓麓 至至至 波波波斯斯斯
湾湾湾 头头头 的的的 平平平原原原 称称称 胡胡胡齐齐齐斯斯斯坦坦坦
。。。
The section
“Afghanistan” of
the Chinese
online
encyclopedia
Baidu were
chosen by us as
the object of
investigation.
Baidu is online encyclo-
pedia in Chinese, which
develops and supports
the Chinese search en-
gine Baidu. As well as
Baidu itself, the ency-
clopedia is censored in
accordance with Chinese
government regulations.
Our work was divided
into several stages:
1. selection of raw texts
about Afghanistan in
Chinese
2. using the word process-
ing program for auto-
matic annotation of the
text and isolation of
terminological phrases
3. updating the terminol-
ogy