SlideShare a Scribd company logo
1 of 37
Download to read offline
ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
This work is partially supported by the RFH grant #13-04-12020
“New open electronic thesaurus for Russian”.
What is Unitex?
• An open-source corpus processor, based on automata-oriented
technology
• mainly developed by Sébastien Paumier at the Institut Gaspard-Monge
(IGM), University of Paris-Est Marne-la-Vallée (France)
• It works on Windows, Linux, Mac OS and other systems
• It has lexical resources for French, English, Greek, Portuguese, Russian,
Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more
• http://www-igm.univ-mlv.fr/~unitex/
2
What is corpus?
A corpus is a collection of pieces of language text in electronic form, selected
according to external criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic research.
Sinclair 2005
“
3
What is Finite State Transducer (FST)?
FST, is a type of finite automaton which maps between two sets of symbols.
We can visualize an FST as a two-tape automaton that recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc in
the finite-state machine with two symbol strings, one from each tape.
Jurafsky 2000
“
4
Simple sentence splitting FST
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде. {S} Достигали высоты ...
5
Get your corpus from a text file in Unitex
1. Run Unitex
• If you are working on Windows, the program will ask you to choose a
personal working directory, which you can change later in
Info>Preferences...>Directories .
2. Select Russian as your working language
• For each language that you will be using, for the first time the
program will copy the root directory of that language to your
personal directory, except the dictionaries.
6
Get your corpus from a text file in Unitex
3. Open corpus-ru-dbpedia-short-dea-1000.csv from the
Corpus subfolder: Text > Open...
4. Preprocess the text
• Apply Sentence.grf in MERGE mode
• Apply Replace.grf in REPLACE mode
• Tokenize the text
• Apply all default dictionaries
• Analyze unknown words as free compound words
7
Preprocessing
• Sentence.grf splits the text into sentences, adding {S} tag before
the next sentence (language dependent)
• Replace.grf removes ¬ (soft hyphen) and converts no-break spaces
to spaces
• The standard separators (the space, the tab and the newline characters)
are normalized
8
Tokenization
• is language (alphabet) dependent
• Newlines in a text are replaced by spaces
• A token can be:
• the sentence delimiter {S}
• the stop marker {STOP} to delimit texts
• a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)}
• a contiguous sequence of letters (from alphabet.txt )
• one (and only one) non-letter character, e.g. a digit
9
Applying dictionaries
• consists of building the subset of dictionaries consisting only of forms
that are present in the text
• The corpus becomes "tagged", i.e. every token is assigned all possible
grammatical forms
• e.g. семью assigned these lexical tags:
семью,семья.N+anim(j)+gen(F):aeF
семью,.ADV
семью,семь.NUM+plur:t
10
Hyponyms and hypernyms
Unlike synonymy and antonymy, which are lexical relations between word
forms, hyponymy/hypernymy is a semantic relation between word meanings:
e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} .
Much attention has been devoted to hyponymy/hypernymy (variously called
subordination/superordination, subset/superset, or the ISA relation)...
“
11
Hyponyms and hypernyms
A concept represented by the synset {x, x ,...} is said to be a hyponym of the
concept represented by the synset {y, y ,...} if native speakers of English accept
sentences constructed from such frames as An x is a (kind of) y. The relation
can be represented by including in {x, x ,...} a pointer to its superordinate, and
including in {y, y ,...} pointers to its hyponyms.
Miller 1993
“
12
Hyponym and hypernym mining from
Russian texts
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S} Достигали
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых
крупных современных наземных млекопитающих —
африканских слонов .
13
Indicators
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S}
1. Text > Locate pattern...
2. Type род into Regular expression
3. Select Index all utterances in text in Search limitation
4. Click Search
14
Concordance
• hyponyms and hypernyms are nouns
• вымерший (participle) and широколиственных (adjective) can be
omitted
15
Patterns in Unitex
1. Text > Locate pattern...
2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N>
3. Click Search
2 matches
Мамонты — вымерший род млекопитающих из семейства слоновых
Бук — род широколиственных деревьев семейства Буковые
01.
02.
16
Lexical masks
• <род> : matches all the entries that have род as canonical form
• <стать.V> : matches all entries having стать as canonical form and
the grammatical code V
• <V> : matches all entries having the grammatical code V
• {стану,стать.V} or <стану,стать.V> : matches all the entries
having стану as inflected form, стать as canonical form and the
grammatical code V
17
Lexical masks.Special symbols
• <E> : the empty word or epsilon. Matches the empty string
• <TOKEN> : matches any token, except the space; used by default for
morphological filters
• <MOT> : matches any token that consists of letters
• <MIN> : matches any lower-case token
• <MAJ> : matches any lower-case token
• <PRE> : matches any token that starts with a capital letter
18
Lexical masks.Special symbols
• <DIC> : matches any word that is present in the dictionaries of the text
• <SDIC> : matches any simple word in the text dictionaries
• <CDIC> : matches any composed word in the dictionaries of the text
• <TDIC> : matches any tagged token like {XXX,XXX.XXX}
• <NB> : matches any contiguous sequence of digit (1234 is matched but
not 1 234)
• <#> : prohibits the presence of space
19
Graphs in Unitex
• can match text (Finite State Automata)
• can produce new output text (Finite State Transducers)
• in MERGE mode combine the matched input text and the output text
(useful fot tagging)
• in REPLACE mode convert the matched input text into the output
text
20
1. FSGraph > New
2. Click on the initial state (arrow), click inside the empty place while
holding Ctrl to create a new box, connected to the initial state, type <N> ,
press Enter
21
A graph for matching text
3. Create a — box, connected to the <N> box
4. Create a род box, connected to the — box
5. Create a <N> box, connected to the род box
6. Click on the second <N> box, click on the final state (a circle with a
square inside) to connect these 2 boxes
7. Create a <V:S> box between the — and род boxes
8. Create a <A>+<!DIC> box between the род and <N> boxes
9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save
22
A graph for matching text
Text > Locate Pattern... , Locate pattern in the form of: Graph, Set
match-hyponyms.grf , Search
23
Transducers in Unitex
1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add
{[ before the matched noun, when the graph is applied in the MERGE
mode
2. Click on the <N>/{[ and click on the — box to disconnect these boxes
3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes.
It will add ]=HYPONYM} after the matched noun
4. Modify the second <N> box for adding a HYPERNYM tag to it
24
Transducers in Unitex
5. Save the graph as tag-hyponyms.grf
25
Tagging hyponyms and hypernyms
1. Text > Locate pattern...
2. Set tag-hyponyms.grf
3. Select Merge with input text in Grammar outputs
4. Click Search
5. Build concordance
• The matched and tagged texts are stored in the concord.ind
file in the corpus folder
corpus-ru-dbpedia-short-dea-1000_snt
26
Tagging hyponyms and hypernyms
{[Мамонты]=HYPONYM} — вымерший род
{[млекопитающих]=HYPERNYM} из семейства слоновых
{[Бук]=HYPONYM} — род широколиственных
{[деревьев]=HYPERNYM} семейства Буковые
• We can then use some script to extract tagged hyponyms and
hypernyms...
• or mine them right in Unitex in the REPLACE mode
01.
02.
27
Mining hyponyms and hypernyms
1. Open match-hyponyms.grf : FSGraph > Open...
2. Click on the first <N> box, right-click on it and select
Surround with > Morphological mode
3. Click on the first <N> box and change it to <N>/$hyponym$ to store
the matched noun with all morphological information in the
$hyponym$ variable
28
Mining hyponyms and hypernyms
4. Modify the second <N> box to store the matched noun in variable
$hypernym$ in the morphological mode
5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the
final state
6. Save this graph as mine-hyponyms.grf
7. In Info > Preferences... > Morphological dictionaries add
Dela/CISLEXru_igrok.bin
29
Mining hyponyms and hypernyms
30
Mining hyponyms and hypernyms
1. Set this graph in Text > Locate pattern...
2. Select Replace recognized sequences in Grammar outputs
3. Click Search
млекопитающее: мамонт
дерево: бук
дерево: бука
дерево: Бук
01.
02.
03.
04.
31
Mining hyponyms and hypernyms
1. Why so many Бук outputs? Let's see in the dictionary: DELA >
Lookup... , select CISLEXru_igrok.bin and enter this word
Бук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM
32
Mining hyponyms and hypernyms
2. Let's modify mine-hyponyms.grf to remove ambiguous outputs:
change the first <N> box to <N~PN:n>
2 outputs
млекопитающее: мамонт
дерево: бук
01.
02.
33
References
1. Jurafsky, D., & James, H. (2000). Speech and language processing an
introduction to natural language processing, computational linguistics,
and speech.
2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).
Introduction to wordnet: An on-line lexical database*. International
journal of lexicography, 3(4), 235-244.
34
References
3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est
Marne-la-Vallée. January 15, 2015,
http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf
4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing
Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow
Books: 1-16. Available online from
http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].
35
Text Processing in Unitex
• PatternSim (github.com/cental/PatternSim) — a tool for calculation
semantic similarity between words from a text corpus based on lexico-
syntactic patterns
• Normatex (github.com/avlukanin/normatex) — Russian text normalization
for speech synthesis, machine translation and other natural language
processing tasks
• Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and
source files used in this tutorial
36
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
• about.me/alukanin
• @avlukanin
• artyom.lukanin@gmail.com
Slides: artyom.ice-lc.com/slides/unitextutorial
37

More Related Content

Similar to Text Processing with Finite State Transducers in Unitex

Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfDeptii Chaudhari
 
Wip2012 01cs
Wip2012 01csWip2012 01cs
Wip2012 01cslouzi1991
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptxsiddhantroy13
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammarmeresie tesfay
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnfTaha Shakeel
 
CH 2.pptx
CH 2.pptxCH 2.pptx
CH 2.pptxObsa2
 
Using Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUUsing Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUSkills Matter
 
MorphologyAndFST.pdf
MorphologyAndFST.pdfMorphologyAndFST.pdf
MorphologyAndFST.pdfssuser97943d
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionEelco Visser
 
Basic techniques in nlp
Basic techniques in nlpBasic techniques in nlp
Basic techniques in nlpSumit Sony
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfHabtamu100
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmDhruvKushwaha12
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 

Similar to Text Processing with Finite State Transducers in Unitex (20)

Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
Wip2012 01cs
Wip2012 01csWip2012 01cs
Wip2012 01cs
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Perl Reference.ppt
Perl Reference.pptPerl Reference.ppt
Perl Reference.ppt
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnf
 
CH 2.pptx
CH 2.pptxCH 2.pptx
CH 2.pptx
 
Using Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUUsing Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPU
 
MorphologyAndFST.pdf
MorphologyAndFST.pdfMorphologyAndFST.pdf
MorphologyAndFST.pdf
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax Definition
 
grammer genration
grammer genration grammer genration
grammer genration
 
Basic techniques in nlp
Basic techniques in nlpBasic techniques in nlp
Basic techniques in nlp
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
Syntax
SyntaxSyntax
Syntax
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 

More from Artem Lukanin

Предварительная обработка и разметка корпуса текстов
Предварительная обработка и разметка корпуса текстовПредварительная обработка и разметка корпуса текстов
Предварительная обработка и разметка корпуса текстовArtem Lukanin
 
Проектирование корпусов
Проектирование корпусовПроектирование корпусов
Проектирование корпусовArtem Lukanin
 
Классификация корпусов
Классификация корпусовКлассификация корпусов
Классификация корпусовArtem Lukanin
 
Основные понятия корпусной лингвистики
Основные понятия корпусной лингвистикиОсновные понятия корпусной лингвистики
Основные понятия корпусной лингвистикиArtem Lukanin
 
Особые корпусы текстов
Особые корпусы текстовОсобые корпусы текстов
Особые корпусы текстовArtem Lukanin
 
Корпусная лингвистика
Корпусная лингвистикаКорпусная лингвистика
Корпусная лингвистикаArtem Lukanin
 
Компьютерная лексикография
Компьютерная лексикографияКомпьютерная лексикография
Компьютерная лексикографияArtem Lukanin
 
Научно-техническая лексикография
Научно-техническая лексикографияНаучно-техническая лексикография
Научно-техническая лексикографияArtem Lukanin
 
Структура значения лексемы
Структура значения лексемыСтруктура значения лексемы
Структура значения лексемыArtem Lukanin
 
Семантический метаязык
Семантический метаязыкСемантический метаязык
Семантический метаязыкArtem Lukanin
 
Классический метод анализа языка на лексико-семантическом уровне
Классический метод анализа языка на лексико-семантическом уровнеКлассический метод анализа языка на лексико-семантическом уровне
Классический метод анализа языка на лексико-семантическом уровнеArtem Lukanin
 
Типология словарей
Типология словарейТипология словарей
Типология словарейArtem Lukanin
 
Понятие лексикографии
Понятие лексикографииПонятие лексикографии
Понятие лексикографииArtem Lukanin
 
Семантическое поле
Семантическое полеСемантическое поле
Семантическое полеArtem Lukanin
 
Введение в информационный поиск
Введение в информационный поискВведение в информационный поиск
Введение в информационный поискArtem Lukanin
 
Системы автоматического распознавания речи
Системы автоматического распознавания речиСистемы автоматического распознавания речи
Системы автоматического распознавания речиArtem Lukanin
 
Системы автоматического синтеза речи
Системы автоматического синтеза речиСистемы автоматического синтеза речи
Системы автоматического синтеза речиArtem Lukanin
 
Криптография
КриптографияКриптография
КриптографияArtem Lukanin
 
Системы аннотирования и реферирования
Системы аннотирования и реферированияСистемы аннотирования и реферирования
Системы аннотирования и реферированияArtem Lukanin
 
Подъязыки в системах машинного перевода
Подъязыки в системах машинного переводаПодъязыки в системах машинного перевода
Подъязыки в системах машинного переводаArtem Lukanin
 

More from Artem Lukanin (20)

Предварительная обработка и разметка корпуса текстов
Предварительная обработка и разметка корпуса текстовПредварительная обработка и разметка корпуса текстов
Предварительная обработка и разметка корпуса текстов
 
Проектирование корпусов
Проектирование корпусовПроектирование корпусов
Проектирование корпусов
 
Классификация корпусов
Классификация корпусовКлассификация корпусов
Классификация корпусов
 
Основные понятия корпусной лингвистики
Основные понятия корпусной лингвистикиОсновные понятия корпусной лингвистики
Основные понятия корпусной лингвистики
 
Особые корпусы текстов
Особые корпусы текстовОсобые корпусы текстов
Особые корпусы текстов
 
Корпусная лингвистика
Корпусная лингвистикаКорпусная лингвистика
Корпусная лингвистика
 
Компьютерная лексикография
Компьютерная лексикографияКомпьютерная лексикография
Компьютерная лексикография
 
Научно-техническая лексикография
Научно-техническая лексикографияНаучно-техническая лексикография
Научно-техническая лексикография
 
Структура значения лексемы
Структура значения лексемыСтруктура значения лексемы
Структура значения лексемы
 
Семантический метаязык
Семантический метаязыкСемантический метаязык
Семантический метаязык
 
Классический метод анализа языка на лексико-семантическом уровне
Классический метод анализа языка на лексико-семантическом уровнеКлассический метод анализа языка на лексико-семантическом уровне
Классический метод анализа языка на лексико-семантическом уровне
 
Типология словарей
Типология словарейТипология словарей
Типология словарей
 
Понятие лексикографии
Понятие лексикографииПонятие лексикографии
Понятие лексикографии
 
Семантическое поле
Семантическое полеСемантическое поле
Семантическое поле
 
Введение в информационный поиск
Введение в информационный поискВведение в информационный поиск
Введение в информационный поиск
 
Системы автоматического распознавания речи
Системы автоматического распознавания речиСистемы автоматического распознавания речи
Системы автоматического распознавания речи
 
Системы автоматического синтеза речи
Системы автоматического синтеза речиСистемы автоматического синтеза речи
Системы автоматического синтеза речи
 
Криптография
КриптографияКриптография
Криптография
 
Системы аннотирования и реферирования
Системы аннотирования и реферированияСистемы аннотирования и реферирования
Системы аннотирования и реферирования
 
Подъязыки в системах машинного перевода
Подъязыки в системах машинного переводаПодъязыки в системах машинного перевода
Подъязыки в системах машинного перевода
 

Recently uploaded

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 

Recently uploaded (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 

Text Processing with Finite State Transducers in Unitex

  • 1. ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS April, 9-11th, 2015, Yekaterinburg Text Processing with Finite State Transducers in Unitex Artem Lukanin This work is partially supported by the RFH grant #13-04-12020 “New open electronic thesaurus for Russian”.
  • 2. What is Unitex? • An open-source corpus processor, based on automata-oriented technology • mainly developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vallée (France) • It works on Windows, Linux, Mac OS and other systems • It has lexical resources for French, English, Greek, Portuguese, Russian, Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more • http://www-igm.univ-mlv.fr/~unitex/ 2
  • 3. What is corpus? A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. Sinclair 2005 “ 3
  • 4. What is Finite State Transducer (FST)? FST, is a type of finite automaton which maps between two sets of symbols. We can visualize an FST as a two-tape automaton that recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the finite-state machine with two symbol strings, one from each tape. Jurafsky 2000 “ 4
  • 5. Simple sentence splitting FST ... в четвертичном периоде. Достигали высоты ... ... в четвертичном периоде. {S} Достигали высоты ... 5
  • 6. Get your corpus from a text file in Unitex 1. Run Unitex • If you are working on Windows, the program will ask you to choose a personal working directory, which you can change later in Info>Preferences...>Directories . 2. Select Russian as your working language • For each language that you will be using, for the first time the program will copy the root directory of that language to your personal directory, except the dictionaries. 6
  • 7. Get your corpus from a text file in Unitex 3. Open corpus-ru-dbpedia-short-dea-1000.csv from the Corpus subfolder: Text > Open... 4. Preprocess the text • Apply Sentence.grf in MERGE mode • Apply Replace.grf in REPLACE mode • Tokenize the text • Apply all default dictionaries • Analyze unknown words as free compound words 7
  • 8. Preprocessing • Sentence.grf splits the text into sentences, adding {S} tag before the next sentence (language dependent) • Replace.grf removes ¬ (soft hyphen) and converts no-break spaces to spaces • The standard separators (the space, the tab and the newline characters) are normalized 8
  • 9. Tokenization • is language (alphabet) dependent • Newlines in a text are replaced by spaces • A token can be: • the sentence delimiter {S} • the stop marker {STOP} to delimit texts • a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)} • a contiguous sequence of letters (from alphabet.txt ) • one (and only one) non-letter character, e.g. a digit 9
  • 10. Applying dictionaries • consists of building the subset of dictionaries consisting only of forms that are present in the text • The corpus becomes "tagged", i.e. every token is assigned all possible grammatical forms • e.g. семью assigned these lexical tags: семью,семья.N+anim(j)+gen(F):aeF семью,.ADV семью,семь.NUM+plur:t 10
  • 11. Hyponyms and hypernyms Unlike synonymy and antonymy, which are lexical relations between word forms, hyponymy/hypernymy is a semantic relation between word meanings: e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} . Much attention has been devoted to hyponymy/hypernymy (variously called subordination/superordination, subset/superset, or the ISA relation)... “ 11
  • 12. Hyponyms and hypernyms A concept represented by the synset {x, x ,...} is said to be a hyponym of the concept represented by the synset {y, y ,...} if native speakers of English accept sentences constructed from such frames as An x is a (kind of) y. The relation can be represented by including in {x, x ,...} a pointer to its superordinate, and including in {y, y ,...} pointers to its hyponyms. Miller 1993 “ 12
  • 13. Hyponym and hypernym mining from Russian texts Мамонты — вымерший род млекопитающих из семейства слоновых, живший в четвертичном периоде.{S} Достигали высоты 5,5 метров и массы тела 10—12 тонн.{S} Таким образом, мамонты были в два раза тяжелее самых крупных современных наземных млекопитающих — африканских слонов . 13
  • 14. Indicators Мамонты — вымерший род млекопитающих из семейства слоновых, живший в четвертичном периоде.{S} 1. Text > Locate pattern... 2. Type род into Regular expression 3. Select Index all utterances in text in Search limitation 4. Click Search 14
  • 15. Concordance • hyponyms and hypernyms are nouns • вымерший (participle) and широколиственных (adjective) can be omitted 15
  • 16. Patterns in Unitex 1. Text > Locate pattern... 2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N> 3. Click Search 2 matches Мамонты — вымерший род млекопитающих из семейства слоновых Бук — род широколиственных деревьев семейства Буковые 01. 02. 16
  • 17. Lexical masks • <род> : matches all the entries that have род as canonical form • <стать.V> : matches all entries having стать as canonical form and the grammatical code V • <V> : matches all entries having the grammatical code V • {стану,стать.V} or <стану,стать.V> : matches all the entries having стану as inflected form, стать as canonical form and the grammatical code V 17
  • 18. Lexical masks.Special symbols • <E> : the empty word or epsilon. Matches the empty string • <TOKEN> : matches any token, except the space; used by default for morphological filters • <MOT> : matches any token that consists of letters • <MIN> : matches any lower-case token • <MAJ> : matches any lower-case token • <PRE> : matches any token that starts with a capital letter 18
  • 19. Lexical masks.Special symbols • <DIC> : matches any word that is present in the dictionaries of the text • <SDIC> : matches any simple word in the text dictionaries • <CDIC> : matches any composed word in the dictionaries of the text • <TDIC> : matches any tagged token like {XXX,XXX.XXX} • <NB> : matches any contiguous sequence of digit (1234 is matched but not 1 234) • <#> : prohibits the presence of space 19
  • 20. Graphs in Unitex • can match text (Finite State Automata) • can produce new output text (Finite State Transducers) • in MERGE mode combine the matched input text and the output text (useful fot tagging) • in REPLACE mode convert the matched input text into the output text 20
  • 21. 1. FSGraph > New 2. Click on the initial state (arrow), click inside the empty place while holding Ctrl to create a new box, connected to the initial state, type <N> , press Enter 21
  • 22. A graph for matching text 3. Create a — box, connected to the <N> box 4. Create a род box, connected to the — box 5. Create a <N> box, connected to the род box 6. Click on the second <N> box, click on the final state (a circle with a square inside) to connect these 2 boxes 7. Create a <V:S> box between the — and род boxes 8. Create a <A>+<!DIC> box between the род and <N> boxes 9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save 22
  • 23. A graph for matching text Text > Locate Pattern... , Locate pattern in the form of: Graph, Set match-hyponyms.grf , Search 23
  • 24. Transducers in Unitex 1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add {[ before the matched noun, when the graph is applied in the MERGE mode 2. Click on the <N>/{[ and click on the — box to disconnect these boxes 3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes. It will add ]=HYPONYM} after the matched noun 4. Modify the second <N> box for adding a HYPERNYM tag to it 24
  • 25. Transducers in Unitex 5. Save the graph as tag-hyponyms.grf 25
  • 26. Tagging hyponyms and hypernyms 1. Text > Locate pattern... 2. Set tag-hyponyms.grf 3. Select Merge with input text in Grammar outputs 4. Click Search 5. Build concordance • The matched and tagged texts are stored in the concord.ind file in the corpus folder corpus-ru-dbpedia-short-dea-1000_snt 26
  • 27. Tagging hyponyms and hypernyms {[Мамонты]=HYPONYM} — вымерший род {[млекопитающих]=HYPERNYM} из семейства слоновых {[Бук]=HYPONYM} — род широколиственных {[деревьев]=HYPERNYM} семейства Буковые • We can then use some script to extract tagged hyponyms and hypernyms... • or mine them right in Unitex in the REPLACE mode 01. 02. 27
  • 28. Mining hyponyms and hypernyms 1. Open match-hyponyms.grf : FSGraph > Open... 2. Click on the first <N> box, right-click on it and select Surround with > Morphological mode 3. Click on the first <N> box and change it to <N>/$hyponym$ to store the matched noun with all morphological information in the $hyponym$ variable 28
  • 29. Mining hyponyms and hypernyms 4. Modify the second <N> box to store the matched noun in variable $hypernym$ in the morphological mode 5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the final state 6. Save this graph as mine-hyponyms.grf 7. In Info > Preferences... > Morphological dictionaries add Dela/CISLEXru_igrok.bin 29
  • 30. Mining hyponyms and hypernyms 30
  • 31. Mining hyponyms and hypernyms 1. Set this graph in Text > Locate pattern... 2. Select Replace recognized sequences in Grammar outputs 3. Click Search млекопитающее: мамонт дерево: бук дерево: бука дерево: Бук 01. 02. 03. 04. 31
  • 32. Mining hyponyms and hypernyms 1. Why so many Бук outputs? Let's see in the dictionary: DELA > Lookup... , select CISLEXru_igrok.bin and enter this word Бук,.N+FAMN+PN+anim(o)+gen(M):neM Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom бук,.N+anim(j)+gen(M):neM:ajeM 32
  • 33. Mining hyponyms and hypernyms 2. Let's modify mine-hyponyms.grf to remove ambiguous outputs: change the first <N> box to <N~PN:n> 2 outputs млекопитающее: мамонт дерево: бук 01. 02. 33
  • 34. References 1. Jurafsky, D., & James, H. (2000). Speech and language processing an introduction to natural language processing, computational linguistics, and speech. 2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4), 235-244. 34
  • 35. References 3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est Marne-la-Vallée. January 15, 2015, http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf 4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 1-16. Available online from http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01]. 35
  • 36. Text Processing in Unitex • PatternSim (github.com/cental/PatternSim) — a tool for calculation semantic similarity between words from a text corpus based on lexico- syntactic patterns • Normatex (github.com/avlukanin/normatex) — Russian text normalization for speech synthesis, machine translation and other natural language processing tasks • Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and source files used in this tutorial 36
  • 37. Text Processing with Finite State Transducers in Unitex Artem Lukanin • about.me/alukanin • @avlukanin • artyom.lukanin@gmail.com Slides: artyom.ice-lc.com/slides/unitextutorial 37