SlideShare a Scribd company logo
1 of 1
Download to read offline
The “Afghanistan” chapter of the Chinese online encyclopedia
”Baidu” as a subject for natural language processing tools
applied for terminology extraction
Bulat Fatkulin, South Ural state university, Chelyabinsk, Russia
April 5, 2014
Variety of oriental
cultures surrounding
Russia is reflected in a
wide range of
Orientalistic branches
(Iranian studies,
Arabic studies,
Turkology, Indology,
Afgan studies etc.)
The Sinology occupies
a leading position
among them. All
major world
civilization centers,
including Russia and
China, have their own
versions of
orientalistics branches
and use their own
terminology. The
following reasons make
Afghan studies in
China actual:
Applied
Linguistics for
Chinese includes
a wide range of
specialized
programs such as:
1. segmenters
2. morphoanalizers
3. parsers
4. characters OCR
systems
There are
numerous
methods of
terminology
extraction from
large amounts of
text, called
corpora.
In our work we used
tools such as:'
&
$
%
Stanford Chi-
nese segmenter
http://nlp.
stanford.edu:
8080/parser/'
&
$
%
Shanghai Chi-
nese language
segmenter
http://hlt030.
cse.ust.hk/
research/
c-assert/
'
&
$
%
Automatic
annotation of
Chinese texts
http://www.
chinese-tools.
com
The program runs from
the command line by
means of this command:
segment.sh [-k] [ctb] | [pku]
<filename> <encoding> <size>
ctb: Chinese Treebank
pku: Beijing Univ.
比比比尔尔尔兼兼兼德德德高高高地地地,,,北北北部部部有有有厄厄厄
尔尔尔布布布兹兹兹山山山脉脉脉,,,德德德马马马万万万德德德峰峰峰
海海海拔拔拔5670米米米,,,为为为伊伊伊朗朗朗最最最高高高
峰峰峰。。。西西西部部部和和和 西西西南南南部部部是是是宽宽宽
阔阔阔的的的扎扎扎格格格罗罗罗斯斯斯山山山山山山系系系,,,约约约
占占占国国国土土土面面面积积积一一一半半半。。。中中中部部部为为为
干干干燥燥燥的的的盆盆盆地地地,,,形形形成成成许许许多多多沙沙沙
漠漠漠,,,有有有卡卡卡维维维尔尔尔荒荒荒漠漠漠与与与卢卢卢特特特
荒荒荒漠漠漠,,,平平平均均均海海海拔拔拔1,,,000余余余
米米米。。。仅仅仅西西西南南南部部部波波波斯斯斯湾湾湾沿沿沿岸岸岸
与与与北北北部部部里里里海海海 沿沿沿岸岸岸有有有小小小面面面积积积
的的的冲冲冲击击击平平平原原原。。。西西西南南南部部部扎扎扎格格格
罗罗罗斯斯斯山山山麓麓麓至至至波波波斯斯斯湾湾湾头头头的的的平平平
原原原称称称胡胡胡齐齐齐斯斯斯坦坦坦。。。
The same Chinese text
after the processing
segmenting has become
much more clear:
尔尔尔 兼兼兼德德德 高高高地地地 ,,, 北北北部部部 有有有
厄厄厄尔尔尔布布布兹兹兹 山山山脉脉脉 ,,, 德德德马马马万万万
德德德峰峰峰 海海海拔拔拔 5670 米米米 ,,, 为为为
伊伊伊朗朗朗 最最最高高高 峰峰峰 。。。 西西西部部部 和和和
西西西南南南部部部 是是是 宽宽宽阔阔阔 的的的 扎扎扎 格格格罗罗罗
斯斯斯 山山山山山山 系系系 ,,, 约约约占占占 国国国土土土 面面面
积积积 一一一半半半 。。。 中中中部部部 为为为 干干干燥燥燥 的的的
盆盆盆地地地 ,,, 形形形成成成 许许许多多多 沙沙沙漠漠漠 ,,,
有有有 卡卡卡维维维尔尔尔 荒荒荒漠漠漠 与与与 卢卢卢特特特 荒荒荒
漠漠漠 ,,, 平平平均均均 海海海拔拔拔 1,,,000余余余
米米米 。。。 仅仅仅 西西西南南南部部部 波波波斯斯斯湾湾湾 沿沿沿
岸岸岸 与与与 北北北部部部 里里里海海海 沿沿沿岸岸岸 有有有 小小小
面面面积积积 的的的 冲冲冲击击击 平平平原原原 。。。 西西西南南南
部部部 扎扎扎 格格格罗罗罗斯斯斯 山山山麓麓麓 至至至 波波波斯斯斯
湾湾湾 头头头 的的的 平平平原原原 称称称 胡胡胡齐齐齐斯斯斯坦坦坦
。。。
The section
“Afghanistan” of
the Chinese
online
encyclopedia
Baidu were
chosen by us as
the object of
investigation.
Baidu is online encyclo-
pedia in Chinese, which
develops and supports
the Chinese search en-
gine Baidu. As well as
Baidu itself, the ency-
clopedia is censored in
accordance with Chinese
government regulations.
Our work was divided
into several stages:
1. selection of raw texts
about Afghanistan in
Chinese
2. using the word process-
ing program for auto-
matic annotation of the
text and isolation of
terminological phrases
3. updating the terminol-
ogy

More Related Content

Viewers also liked

Marina Danshina - Semiotic system of musical texts
Marina Danshina - Semiotic system of musical textsMarina Danshina - Semiotic system of musical texts
Marina Danshina - Semiotic system of musical textsAIST
 
Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...
Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...
Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...AIST
 
Nikolay Karpov - Single-sentence readability prediction in russian
Nikolay Karpov - Single-sentence readability prediction in russianNikolay Karpov - Single-sentence readability prediction in russian
Nikolay Karpov - Single-sentence readability prediction in russianAIST
 
Konstantion Vorontsov - Additive regularization of matrix decompositons and p...
Konstantion Vorontsov - Additive regularization of matrix decompositons and p...Konstantion Vorontsov - Additive regularization of matrix decompositons and p...
Konstantion Vorontsov - Additive regularization of matrix decompositons and p...AIST
 
Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...
Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...
Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...AIST
 
Dialogue systems and personal assistants
Dialogue systems and personal assistantsDialogue systems and personal assistants
Dialogue systems and personal assistantsNatalia Konstantinova
 
Alexander Semenov - Recent Advances in Social Network Analysis
Alexander Semenov - Recent Advances in Social Network AnalysisAlexander Semenov - Recent Advances in Social Network Analysis
Alexander Semenov - Recent Advances in Social Network AnalysisAIST
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data JournalismIrina Radchenko
 
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...AIST
 
Daniel Khachay - GPS navigation algorithm based on osm data
Daniel Khachay - GPS navigation algorithm based on osm dataDaniel Khachay - GPS navigation algorithm based on osm data
Daniel Khachay - GPS navigation algorithm based on osm dataAIST
 

Viewers also liked (10)

Marina Danshina - Semiotic system of musical texts
Marina Danshina - Semiotic system of musical textsMarina Danshina - Semiotic system of musical texts
Marina Danshina - Semiotic system of musical texts
 
Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...
Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...
Нургуль Маматова - Применение модели векторной авторегрессии для анализа потр...
 
Nikolay Karpov - Single-sentence readability prediction in russian
Nikolay Karpov - Single-sentence readability prediction in russianNikolay Karpov - Single-sentence readability prediction in russian
Nikolay Karpov - Single-sentence readability prediction in russian
 
Konstantion Vorontsov - Additive regularization of matrix decompositons and p...
Konstantion Vorontsov - Additive regularization of matrix decompositons and p...Konstantion Vorontsov - Additive regularization of matrix decompositons and p...
Konstantion Vorontsov - Additive regularization of matrix decompositons and p...
 
Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...
Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...
Елена Малютина - Оценка параметров хаотического процесса с помощью Ukf-фильтр...
 
Dialogue systems and personal assistants
Dialogue systems and personal assistantsDialogue systems and personal assistants
Dialogue systems and personal assistants
 
Alexander Semenov - Recent Advances in Social Network Analysis
Alexander Semenov - Recent Advances in Social Network AnalysisAlexander Semenov - Recent Advances in Social Network Analysis
Alexander Semenov - Recent Advances in Social Network Analysis
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
 
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
 
Daniel Khachay - GPS navigation algorithm based on osm data
Daniel Khachay - GPS navigation algorithm based on osm dataDaniel Khachay - GPS navigation algorithm based on osm data
Daniel Khachay - GPS navigation algorithm based on osm data
 

More from AIST

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray ImagesAIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныAIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискAIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAAIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeAIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesAIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsAIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceAIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumAIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 

Bulat Fatkulin - The Afghanistan chapter of the chinese online encyclopedia baidu as a subject for natural language processing tools applied for terminology extraction

  • 1. The “Afghanistan” chapter of the Chinese online encyclopedia ”Baidu” as a subject for natural language processing tools applied for terminology extraction Bulat Fatkulin, South Ural state university, Chelyabinsk, Russia April 5, 2014 Variety of oriental cultures surrounding Russia is reflected in a wide range of Orientalistic branches (Iranian studies, Arabic studies, Turkology, Indology, Afgan studies etc.) The Sinology occupies a leading position among them. All major world civilization centers, including Russia and China, have their own versions of orientalistics branches and use their own terminology. The following reasons make Afghan studies in China actual: Applied Linguistics for Chinese includes a wide range of specialized programs such as: 1. segmenters 2. morphoanalizers 3. parsers 4. characters OCR systems There are numerous methods of terminology extraction from large amounts of text, called corpora. In our work we used tools such as:' & $ % Stanford Chi- nese segmenter http://nlp. stanford.edu: 8080/parser/' & $ % Shanghai Chi- nese language segmenter http://hlt030. cse.ust.hk/ research/ c-assert/ ' & $ % Automatic annotation of Chinese texts http://www. chinese-tools. com The program runs from the command line by means of this command: segment.sh [-k] [ctb] | [pku] <filename> <encoding> <size> ctb: Chinese Treebank pku: Beijing Univ. 比比比尔尔尔兼兼兼德德德高高高地地地,,,北北北部部部有有有厄厄厄 尔尔尔布布布兹兹兹山山山脉脉脉,,,德德德马马马万万万德德德峰峰峰 海海海拔拔拔5670米米米,,,为为为伊伊伊朗朗朗最最最高高高 峰峰峰。。。西西西部部部和和和 西西西南南南部部部是是是宽宽宽 阔阔阔的的的扎扎扎格格格罗罗罗斯斯斯山山山山山山系系系,,,约约约 占占占国国国土土土面面面积积积一一一半半半。。。中中中部部部为为为 干干干燥燥燥的的的盆盆盆地地地,,,形形形成成成许许许多多多沙沙沙 漠漠漠,,,有有有卡卡卡维维维尔尔尔荒荒荒漠漠漠与与与卢卢卢特特特 荒荒荒漠漠漠,,,平平平均均均海海海拔拔拔1,,,000余余余 米米米。。。仅仅仅西西西南南南部部部波波波斯斯斯湾湾湾沿沿沿岸岸岸 与与与北北北部部部里里里海海海 沿沿沿岸岸岸有有有小小小面面面积积积 的的的冲冲冲击击击平平平原原原。。。西西西南南南部部部扎扎扎格格格 罗罗罗斯斯斯山山山麓麓麓至至至波波波斯斯斯湾湾湾头头头的的的平平平 原原原称称称胡胡胡齐齐齐斯斯斯坦坦坦。。。 The same Chinese text after the processing segmenting has become much more clear: 尔尔尔 兼兼兼德德德 高高高地地地 ,,, 北北北部部部 有有有 厄厄厄尔尔尔布布布兹兹兹 山山山脉脉脉 ,,, 德德德马马马万万万 德德德峰峰峰 海海海拔拔拔 5670 米米米 ,,, 为为为 伊伊伊朗朗朗 最最最高高高 峰峰峰 。。。 西西西部部部 和和和 西西西南南南部部部 是是是 宽宽宽阔阔阔 的的的 扎扎扎 格格格罗罗罗 斯斯斯 山山山山山山 系系系 ,,, 约约约占占占 国国国土土土 面面面 积积积 一一一半半半 。。。 中中中部部部 为为为 干干干燥燥燥 的的的 盆盆盆地地地 ,,, 形形形成成成 许许许多多多 沙沙沙漠漠漠 ,,, 有有有 卡卡卡维维维尔尔尔 荒荒荒漠漠漠 与与与 卢卢卢特特特 荒荒荒 漠漠漠 ,,, 平平平均均均 海海海拔拔拔 1,,,000余余余 米米米 。。。 仅仅仅 西西西南南南部部部 波波波斯斯斯湾湾湾 沿沿沿 岸岸岸 与与与 北北北部部部 里里里海海海 沿沿沿岸岸岸 有有有 小小小 面面面积积积 的的的 冲冲冲击击击 平平平原原原 。。。 西西西南南南 部部部 扎扎扎 格格格罗罗罗斯斯斯 山山山麓麓麓 至至至 波波波斯斯斯 湾湾湾 头头头 的的的 平平平原原原 称称称 胡胡胡齐齐齐斯斯斯坦坦坦 。。。 The section “Afghanistan” of the Chinese online encyclopedia Baidu were chosen by us as the object of investigation. Baidu is online encyclo- pedia in Chinese, which develops and supports the Chinese search en- gine Baidu. As well as Baidu itself, the ency- clopedia is censored in accordance with Chinese government regulations. Our work was divided into several stages: 1. selection of raw texts about Afghanistan in Chinese 2. using the word process- ing program for auto- matic annotation of the text and isolation of terminological phrases 3. updating the terminol- ogy