SlideShare a Scribd company logo
1 of 37
Download to read offline
ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
What is Unitex?
• An open-source corpus processor, based on automata-oriented
technology
• mainly developed by Sébastien Paumier at the Institut Gaspard-Monge
(IGM), University of Paris-Est Marne-la-Vallée (France)
• It works on Windows, Linux, Mac OS and other systems
• It has lexical resources for French, English, Greek, Portuguese, Russian,
Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more
• http://www-igm.univ-mlv.fr/~unitex/
2
What is corpus?
A corpus is a collection of pieces of language text in electronic form, selected
according to external criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic research.
Sinclair 2005
“
3
What is Finite State Transducer (FST)?
FST, is a type of finite automaton which maps between two sets of symbols.
We can visualize an FST as a two-tape automaton that recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc in
the finite-state machine with two symbol strings, one from each tape.
Jurafsky 2000
“
4
Simple sentence splitting FST
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде. {S} Достигали высоты ...
5
Get your corpus from a text file in Unitex
1. Run Unitex
• If you are working on Windows, the program will ask you to choose a
personal working directory, which you can change later in
Info>Preferences...>Directories .
2. Select Russian as your working language
• For each language that you will be using, for the first time the
program will copy the root directory of that language to your
personal directory, except the dictionaries.
6
Get your corpus from a text file in Unitex
3. Open corpus-ru-dbpedia-short-dea-1000.csv from the
Corpus subfolder: Text > Open...
4. Preprocess the text
• Apply Sentence.grf in MERGE mode
• Apply Replace.grf in REPLACE mode
• Tokenize the text
• Apply all default dictionaries
• Analyze unknown words as free compound words
7
Preprocessing
• Sentence.grf splits the text into sentences, adding {S} tag before
the next sentence (language dependent)
• Replace.grf removes ¬ (soft hyphen) and converts no-break spaces
to spaces
• The standard separators (the space, the tab and the newline characters)
are normalized
8
Tokenization
• is language (alphabet) dependent
• Newlines in a text are replaced by spaces
• A token can be:
• the sentence delimiter {S}
• the stop marker {STOP} to delimit texts
• a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)}
• a contiguous sequence of letters (from alphabet.txt )
• one (and only one) non-letter character, e.g. a digit
9
Applying dictionaries
• consists of building the subset of dictionaries consisting only of forms
that are present in the text
• The corpus becomes "tagged", i.e. every token is assigned all possible
grammatical forms
• e.g. семью assigned these lexical tags:
 семью,семья.N+anim(j)+gen(F):aeF
 семью,.ADV
 семью,семь.NUM+plur:t
10
Hyponyms and hypernyms
Unlike synonymy and antonymy, which are lexical relations between word
forms, hyponymy/hypernymy is a semantic relation between word meanings:
e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} .
Much attention has been devoted to hyponymy/hypernymy (variously called
subordination/superordination, subset/superset, or the ISA relation)...
“
11
Hyponyms and hypernyms
A concept represented by the synset {x, x′,...} is said to be a hyponym of the
concept represented by the synset {y, y′,...} if native speakers of English accept
sentences constructed from such frames as An x is a (kind of) y. The relation
can be represented by including in {x, x′,...} a pointer to its superordinate, and
including in {y, y′,...} pointers to its hyponyms.
Miller 1993
“
12
Hyponym and hypernym mining from
Russian texts
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S} Достигали
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых
крупных современных наземных млекопитающих —
африканских слонов .
13
Indicators
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S}
1. Text > Locate pattern...
2. Type род into Regular expression
3. Select Index all utterances in text in Search limitation
4. Click Search
14
Concordance
• hyponyms and hypernyms are nouns
• вымерший (participle) and широколиственных (adjective) can be
omitted
15
Patterns in Unitex
1. Text > Locate pattern...
2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N>
3. Click Search
2 matches
Мамонты — вымерший род млекопитающих из семейства слоновых
Бук — род широколиственных деревьев семейства Буковые
01.
02.
16
Lexical masks
• <род> : matches all the entries that have род as canonical form
• <стать.V> : matches all entries having стать as canonical form and
the grammatical code V
• <V> : matches all entries having the grammatical code V
• {стану,стать.V} or <стану,стать.V> : matches all the entries
having стану as inflected form, стать as canonical form and the
grammatical code V
17
Lexical masks.Special symbols
• <E> : the empty word or epsilon. Matches the empty string
• <TOKEN> : matches any token, except the space; used by default for
morphological filters
• <MOT> : matches any token that consists of letters
• <MIN> : matches any lower-case token
• <MAJ> : matches any lower-case token
• <PRE> : matches any token that starts with a capital letter
18
Lexical masks.Special symbols
• <DIC> : matches any word that is present in the dictionaries of the text
• <SDIC> : matches any simple word in the text dictionaries
• <CDIC> : matches any composed word in the dictionaries of the text
• <TDIC> : matches any tagged token like {XXX,XXX.XXX}
• <NB> : matches any contiguous sequence of digit (1234 is matched but
not 1 234)
• <#> : prohibits the presence of space
19
Graphs in Unitex
• can match text (Finite State Automata)
• can produce new output text (Finite State Transducers)
• in MERGE mode combine the matched input text and the output text
(useful fot tagging)
• in REPLACE mode convert the matched input text into the output
text
20
1. FSGraph > New
2. Click on the initial state (arrow), click inside the empty place while
holding Ctrl to create a new box, connected to the initial state, type <N> ,
press Enter
21
A graph for matching text
3. Create a — box, connected to the <N> box
4. Create a род box, connected to the — box
5. Create a <N> box, connected to the род box
6. Click on the second <N> box, click on the final state (a circle with a
square inside) to connect these 2 boxes
7. Create a <V:S> box between the — and род boxes
8. Create a <A>+<!DIC> box between the род and <N> boxes
9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save
22
A graph for matching text
Text > Locate Pattern... , Locate pattern in the form of: Graph, Set
match-hyponyms.grf , Search
23
Transducers in Unitex
1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add
{[ before the matched noun, when the graph is applied in the MERGE
mode
2. Click on the <N>/{[ and click on the — box to disconnect these boxes
3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes.
It will add ]=HYPONYM} after the matched noun
4. Modify the second <N> box for adding a HYPERNYM tag to it
24
Transducers in Unitex
5. Save the graph as tag-hyponyms.grf
25
Tagging hyponyms and hypernyms
1. Text > Locate pattern...
2. Set tag-hyponyms.grf
3. Select Merge with input text in Grammar outputs
4. Click Search
5. Build concordance
• The matched and tagged texts are stored in the concord.ind
file in the corpus folder
corpus-ru-dbpedia-short-dea-1000_snt
26
Tagging hyponyms and hypernyms
{[Мамонты]=HYPONYM} — вымерший род
{[млекопитающих]=HYPERNYM} из семейства слоновых
{[Бук]=HYPONYM} — род широколиственных
{[деревьев]=HYPERNYM} семейства Буковые
• We can then use some script to extract tagged hyponyms and
hypernyms...
• or mine them right in Unitex in the REPLACE mode
01.
02.
27
Mining hyponyms and hypernyms
1. Open match-hyponyms.grf : FSGraph > Open...
2. Click on the first <N> box, right-click on it and select
Surround with > Morphological mode
3. Click on the first <N> box and change it to <N>/$hyponym$ to store
the matched noun with all morphological information in the
$hyponym$ variable
28
Mining hyponyms and hypernyms
4. Modify the second <N> box to store the matched noun in variable
$hypernym$ in the morphological mode
5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the
final state
6. Save this graph as mine-hyponyms.grf
7. In Info > Preferences... > Morphological dictionaries add
Dela/CISLEXru_igrok.bin
29
Mining hyponyms and hypernyms
30
Mining hyponyms and hypernyms
1. Set this graph in Text > Locate pattern...
2. Select Replace recognized sequences in Grammar outputs
3. Click Search
млекопитающее: мамонт
дерево: бук
дерево: бука
дерево: Бук
01.
02.
03.
04.
31
Mining hyponyms and hypernyms
1. Why so many Бук outputs? Let's see in the dictionary: DELA >
Lookup... , select CISLEXru_igrok.bin and enter this word
Бук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM
32
Mining hyponyms and hypernyms
2. Let's modify mine-hyponyms.grf to remove ambiguous outputs:
change the first <N> box to <N~PN:n>
2 outputs
млекопитающее: мамонт
дерево: бук
01.
02.
33
References
1. Jurafsky, D., & James, H. (2000). Speech and language processing an
introduction to natural language processing, computational linguistics,
and speech.
2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).
Introduction to wordnet: An on-line lexical database*. International
journal of lexicography, 3(4), 235-244.
34
References
3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est
Marne-la-Vallée. January 15, 2015,
http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf
4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing
Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow
Books: 1-16. Available online from
http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].
35
Text Processing in Unitex
• PatternSim (github.com/cental/PatternSim) — a tool for calculation
semantic similarity between words from a text corpus based on lexico-
syntactic patterns
• Normatex (github.com/avlukanin/normatex) — Russian text normalization
for speech synthesis, machine translation and other natural language
processing tasks
• Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and
source files used in this tutorial
36
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
• about.me/alukanin
• @avlukanin
• artyom.lukanin@gmail.com
Slides: artyom.ice-lc.com/slides/unitextutorial
37

More Related Content

Viewers also liked

Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...
Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...
Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...AIST
 
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...AIST
 
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...AIST
 
Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?
Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?
Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?AIST
 
Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...
Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...
Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...AIST
 
Iosif Itkin - Network models for exchange trade analysis
Iosif Itkin - Network models for exchange trade analysisIosif Itkin - Network models for exchange trade analysis
Iosif Itkin - Network models for exchange trade analysisAIST
 
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...AIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeAIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...AIST
 
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...AIST
 
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...AIST
 
Alexander Mikov - Program Tools for Dynamic Investigation of Social Networks
Alexander Mikov - Program Tools for Dynamic Investigation of Social NetworksAlexander Mikov - Program Tools for Dynamic Investigation of Social Networks
Alexander Mikov - Program Tools for Dynamic Investigation of Social NetworksAIST
 
Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...
Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...
Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...AIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...AIST
 
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...AIST
 
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...AIST
 
Alexander Panchenko - Human and Machine Judgements about Russian Semantic Re...
Alexander Panchenko - Human and Machine Judgements about Russian  Semantic Re...Alexander Panchenko - Human and Machine Judgements about Russian  Semantic Re...
Alexander Panchenko - Human and Machine Judgements about Russian Semantic Re...AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...AIST
 

Viewers also liked (19)

Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...
Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...
Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel...
 
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
Dmitrii Stepanov, Aleksandr Bakhshiev, D.Gromoshinsky, N.Kirpan F.Gundelakh -...
 
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
E. Ostheimer, V. G. Labunets, D. E. Komarov, T. S. Fedorova and V. V. Ganzha ...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?
Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?
Pavel Braslavsky - Questions Online: What, Where, and Why Should we Care?
 
Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...
Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...
Dmitry Ustalov — TagBag: Annotating a Foreign Language Lexical Resource with ...
 
Iosif Itkin - Network models for exchange trade analysis
Iosif Itkin - Network models for exchange trade analysisIosif Itkin - Network models for exchange trade analysis
Iosif Itkin - Network models for exchange trade analysis
 
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
Ilya Trofimov - Distributed Coordinate Descent for L1-regularized Logistic Re...
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
Елена Захаренко и Евгений Альтман - Быстрый алгоритм оценки движения полным п...
 
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
V. G. Labunets, F. S. Myasnikov, E. Ostheimer - Families of Heron Digital Fil...
 
Alexander Mikov - Program Tools for Dynamic Investigation of Social Networks
Alexander Mikov - Program Tools for Dynamic Investigation of Social NetworksAlexander Mikov - Program Tools for Dynamic Investigation of Social Networks
Alexander Mikov - Program Tools for Dynamic Investigation of Social Networks
 
Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...
Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...
Alexandra Barysheva - Building Profiles of Blog Users Based on Comment Graph ...
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...
 
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
Aleksey Demidov - Evolving ontologies in the aspect of handling temporal or c...
 
Alexander Panchenko - Human and Machine Judgements about Russian Semantic Re...
Alexander Panchenko - Human and Machine Judgements about Russian  Semantic Re...Alexander Panchenko - Human and Machine Judgements about Russian  Semantic Re...
Alexander Panchenko - Human and Machine Judgements about Russian Semantic Re...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 

Similar to Artem Lukanin - Text Processing with Finite State Transducers in Unitex

Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
 
INFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.pptINFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.pptLamhotNaibaho3
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammarmeresie tesfay
 
Wip2012 01cs
Wip2012 01csWip2012 01cs
Wip2012 01cslouzi1991
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionEelco Visser
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
NLP Finite state machine needed.ppt
NLP Finite state machine needed.pptNLP Finite state machine needed.ppt
NLP Finite state machine needed.pptdiazdj
 
Using Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUUsing Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUSkills Matter
 
CH 2.pptx
CH 2.pptxCH 2.pptx
CH 2.pptxObsa2
 
Free Ebooks Download ! Edhole
Free Ebooks Download ! EdholeFree Ebooks Download ! Edhole
Free Ebooks Download ! EdholeEdhole.com
 
Mba ebooks ! Edhole
Mba ebooks ! EdholeMba ebooks ! Edhole
Mba ebooks ! EdholeEdhole.com
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnfTaha Shakeel
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfDeptii Chaudhari
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
MorphologyAndFST.pdf
MorphologyAndFST.pdfMorphologyAndFST.pdf
MorphologyAndFST.pdfssuser97943d
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Editor IJARCET
 

Similar to Artem Lukanin - Text Processing with Finite State Transducers in Unitex (20)

Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducers
 
INFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.pptINFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.ppt
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
Perl Reference.ppt
Perl Reference.pptPerl Reference.ppt
Perl Reference.ppt
 
Wip2012 01cs
Wip2012 01csWip2012 01cs
Wip2012 01cs
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax Definition
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP Finite state machine needed.ppt
NLP Finite state machine needed.pptNLP Finite state machine needed.ppt
NLP Finite state machine needed.ppt
 
Using Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUUsing Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPU
 
CH 2.pptx
CH 2.pptxCH 2.pptx
CH 2.pptx
 
RegexCat
RegexCatRegexCat
RegexCat
 
Free Ebooks Download ! Edhole
Free Ebooks Download ! EdholeFree Ebooks Download ! Edhole
Free Ebooks Download ! Edhole
 
Mba ebooks ! Edhole
Mba ebooks ! EdholeMba ebooks ! Edhole
Mba ebooks ! Edhole
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnf
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Syntax
SyntaxSyntax
Syntax
 
Knowledge Extraction
Knowledge ExtractionKnowledge Extraction
Knowledge Extraction
 
MorphologyAndFST.pdf
MorphologyAndFST.pdfMorphologyAndFST.pdf
MorphologyAndFST.pdf
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 

More from AIST

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray ImagesAIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныAIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискAIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...AIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesAIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsAIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceAIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumAIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...AIST
 
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...AIST
 
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...AIST
 
Anton Korsakov - Determination of an unmanned mobile object orientation by na...
Anton Korsakov - Determination of an unmanned mobile object orientation by na...Anton Korsakov - Determination of an unmanned mobile object orientation by na...
Anton Korsakov - Determination of an unmanned mobile object orientation by na...AIST
 
Artem Kruglov and Yurii Chiryshev - Detection and Tracking of the Objects in...
Artem Kruglov and  Yurii Chiryshev - Detection and Tracking of the Objects in...Artem Kruglov and  Yurii Chiryshev - Detection and Tracking of the Objects in...
Artem Kruglov and Yurii Chiryshev - Detection and Tracking of the Objects in...AIST
 
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...AIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
 
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
 
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
 
Anton Korsakov - Determination of an unmanned mobile object orientation by na...
Anton Korsakov - Determination of an unmanned mobile object orientation by na...Anton Korsakov - Determination of an unmanned mobile object orientation by na...
Anton Korsakov - Determination of an unmanned mobile object orientation by na...
 
Artem Kruglov and Yurii Chiryshev - Detection and Tracking of the Objects in...
Artem Kruglov and  Yurii Chiryshev - Detection and Tracking of the Objects in...Artem Kruglov and  Yurii Chiryshev - Detection and Tracking of the Objects in...
Artem Kruglov and Yurii Chiryshev - Detection and Tracking of the Objects in...
 
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
 

Recently uploaded

OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)Basil Achie
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝soniya singh
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...NETWAYS
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptxBasil Achie
 

Recently uploaded (20)

OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
 

Artem Lukanin - Text Processing with Finite State Transducers in Unitex

  • 1. ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS April, 9-11th, 2015, Yekaterinburg Text Processing with Finite State Transducers in Unitex Artem Lukanin
  • 2. What is Unitex? • An open-source corpus processor, based on automata-oriented technology • mainly developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vallée (France) • It works on Windows, Linux, Mac OS and other systems • It has lexical resources for French, English, Greek, Portuguese, Russian, Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more • http://www-igm.univ-mlv.fr/~unitex/ 2
  • 3. What is corpus? A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. Sinclair 2005 “ 3
  • 4. What is Finite State Transducer (FST)? FST, is a type of finite automaton which maps between two sets of symbols. We can visualize an FST as a two-tape automaton that recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the finite-state machine with two symbol strings, one from each tape. Jurafsky 2000 “ 4
  • 5. Simple sentence splitting FST ... в четвертичном периоде. Достигали высоты ... ... в четвертичном периоде. {S} Достигали высоты ... 5
  • 6. Get your corpus from a text file in Unitex 1. Run Unitex • If you are working on Windows, the program will ask you to choose a personal working directory, which you can change later in Info>Preferences...>Directories . 2. Select Russian as your working language • For each language that you will be using, for the first time the program will copy the root directory of that language to your personal directory, except the dictionaries. 6
  • 7. Get your corpus from a text file in Unitex 3. Open corpus-ru-dbpedia-short-dea-1000.csv from the Corpus subfolder: Text > Open... 4. Preprocess the text • Apply Sentence.grf in MERGE mode • Apply Replace.grf in REPLACE mode • Tokenize the text • Apply all default dictionaries • Analyze unknown words as free compound words 7
  • 8. Preprocessing • Sentence.grf splits the text into sentences, adding {S} tag before the next sentence (language dependent) • Replace.grf removes ¬ (soft hyphen) and converts no-break spaces to spaces • The standard separators (the space, the tab and the newline characters) are normalized 8
  • 9. Tokenization • is language (alphabet) dependent • Newlines in a text are replaced by spaces • A token can be: • the sentence delimiter {S} • the stop marker {STOP} to delimit texts • a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)} • a contiguous sequence of letters (from alphabet.txt ) • one (and only one) non-letter character, e.g. a digit 9
  • 10. Applying dictionaries • consists of building the subset of dictionaries consisting only of forms that are present in the text • The corpus becomes "tagged", i.e. every token is assigned all possible grammatical forms • e.g. семью assigned these lexical tags:  семью,семья.N+anim(j)+gen(F):aeF  семью,.ADV  семью,семь.NUM+plur:t 10
  • 11. Hyponyms and hypernyms Unlike synonymy and antonymy, which are lexical relations between word forms, hyponymy/hypernymy is a semantic relation between word meanings: e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} . Much attention has been devoted to hyponymy/hypernymy (variously called subordination/superordination, subset/superset, or the ISA relation)... “ 11
  • 12. Hyponyms and hypernyms A concept represented by the synset {x, x′,...} is said to be a hyponym of the concept represented by the synset {y, y′,...} if native speakers of English accept sentences constructed from such frames as An x is a (kind of) y. The relation can be represented by including in {x, x′,...} a pointer to its superordinate, and including in {y, y′,...} pointers to its hyponyms. Miller 1993 “ 12
  • 13. Hyponym and hypernym mining from Russian texts Мамонты — вымерший род млекопитающих из семейства слоновых, живший в четвертичном периоде.{S} Достигали высоты 5,5 метров и массы тела 10—12 тонн.{S} Таким образом, мамонты были в два раза тяжелее самых крупных современных наземных млекопитающих — африканских слонов . 13
  • 14. Indicators Мамонты — вымерший род млекопитающих из семейства слоновых, живший в четвертичном периоде.{S} 1. Text > Locate pattern... 2. Type род into Regular expression 3. Select Index all utterances in text in Search limitation 4. Click Search 14
  • 15. Concordance • hyponyms and hypernyms are nouns • вымерший (participle) and широколиственных (adjective) can be omitted 15
  • 16. Patterns in Unitex 1. Text > Locate pattern... 2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N> 3. Click Search 2 matches Мамонты — вымерший род млекопитающих из семейства слоновых Бук — род широколиственных деревьев семейства Буковые 01. 02. 16
  • 17. Lexical masks • <род> : matches all the entries that have род as canonical form • <стать.V> : matches all entries having стать as canonical form and the grammatical code V • <V> : matches all entries having the grammatical code V • {стану,стать.V} or <стану,стать.V> : matches all the entries having стану as inflected form, стать as canonical form and the grammatical code V 17
  • 18. Lexical masks.Special symbols • <E> : the empty word or epsilon. Matches the empty string • <TOKEN> : matches any token, except the space; used by default for morphological filters • <MOT> : matches any token that consists of letters • <MIN> : matches any lower-case token • <MAJ> : matches any lower-case token • <PRE> : matches any token that starts with a capital letter 18
  • 19. Lexical masks.Special symbols • <DIC> : matches any word that is present in the dictionaries of the text • <SDIC> : matches any simple word in the text dictionaries • <CDIC> : matches any composed word in the dictionaries of the text • <TDIC> : matches any tagged token like {XXX,XXX.XXX} • <NB> : matches any contiguous sequence of digit (1234 is matched but not 1 234) • <#> : prohibits the presence of space 19
  • 20. Graphs in Unitex • can match text (Finite State Automata) • can produce new output text (Finite State Transducers) • in MERGE mode combine the matched input text and the output text (useful fot tagging) • in REPLACE mode convert the matched input text into the output text 20
  • 21. 1. FSGraph > New 2. Click on the initial state (arrow), click inside the empty place while holding Ctrl to create a new box, connected to the initial state, type <N> , press Enter 21
  • 22. A graph for matching text 3. Create a — box, connected to the <N> box 4. Create a род box, connected to the — box 5. Create a <N> box, connected to the род box 6. Click on the second <N> box, click on the final state (a circle with a square inside) to connect these 2 boxes 7. Create a <V:S> box between the — and род boxes 8. Create a <A>+<!DIC> box between the род and <N> boxes 9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save 22
  • 23. A graph for matching text Text > Locate Pattern... , Locate pattern in the form of: Graph, Set match-hyponyms.grf , Search 23
  • 24. Transducers in Unitex 1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add {[ before the matched noun, when the graph is applied in the MERGE mode 2. Click on the <N>/{[ and click on the — box to disconnect these boxes 3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes. It will add ]=HYPONYM} after the matched noun 4. Modify the second <N> box for adding a HYPERNYM tag to it 24
  • 25. Transducers in Unitex 5. Save the graph as tag-hyponyms.grf 25
  • 26. Tagging hyponyms and hypernyms 1. Text > Locate pattern... 2. Set tag-hyponyms.grf 3. Select Merge with input text in Grammar outputs 4. Click Search 5. Build concordance • The matched and tagged texts are stored in the concord.ind file in the corpus folder corpus-ru-dbpedia-short-dea-1000_snt 26
  • 27. Tagging hyponyms and hypernyms {[Мамонты]=HYPONYM} — вымерший род {[млекопитающих]=HYPERNYM} из семейства слоновых {[Бук]=HYPONYM} — род широколиственных {[деревьев]=HYPERNYM} семейства Буковые • We can then use some script to extract tagged hyponyms and hypernyms... • or mine them right in Unitex in the REPLACE mode 01. 02. 27
  • 28. Mining hyponyms and hypernyms 1. Open match-hyponyms.grf : FSGraph > Open... 2. Click on the first <N> box, right-click on it and select Surround with > Morphological mode 3. Click on the first <N> box and change it to <N>/$hyponym$ to store the matched noun with all morphological information in the $hyponym$ variable 28
  • 29. Mining hyponyms and hypernyms 4. Modify the second <N> box to store the matched noun in variable $hypernym$ in the morphological mode 5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the final state 6. Save this graph as mine-hyponyms.grf 7. In Info > Preferences... > Morphological dictionaries add Dela/CISLEXru_igrok.bin 29
  • 30. Mining hyponyms and hypernyms 30
  • 31. Mining hyponyms and hypernyms 1. Set this graph in Text > Locate pattern... 2. Select Replace recognized sequences in Grammar outputs 3. Click Search млекопитающее: мамонт дерево: бук дерево: бука дерево: Бук 01. 02. 03. 04. 31
  • 32. Mining hyponyms and hypernyms 1. Why so many Бук outputs? Let's see in the dictionary: DELA > Lookup... , select CISLEXru_igrok.bin and enter this word Бук,.N+FAMN+PN+anim(o)+gen(M):neM Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom бук,.N+anim(j)+gen(M):neM:ajeM 32
  • 33. Mining hyponyms and hypernyms 2. Let's modify mine-hyponyms.grf to remove ambiguous outputs: change the first <N> box to <N~PN:n> 2 outputs млекопитающее: мамонт дерево: бук 01. 02. 33
  • 34. References 1. Jurafsky, D., & James, H. (2000). Speech and language processing an introduction to natural language processing, computational linguistics, and speech. 2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4), 235-244. 34
  • 35. References 3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est Marne-la-Vallée. January 15, 2015, http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf 4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 1-16. Available online from http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01]. 35
  • 36. Text Processing in Unitex • PatternSim (github.com/cental/PatternSim) — a tool for calculation semantic similarity between words from a text corpus based on lexico- syntactic patterns • Normatex (github.com/avlukanin/normatex) — Russian text normalization for speech synthesis, machine translation and other natural language processing tasks • Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and source files used in this tutorial 36
  • 37. Text Processing with Finite State Transducers in Unitex Artem Lukanin • about.me/alukanin • @avlukanin • artyom.lukanin@gmail.com Slides: artyom.ice-lc.com/slides/unitextutorial 37