Text Processing with Finite State Transducers in Unitex

ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
This work is partially supported by the RFH grant #13-04-12020
“New open electronic thesaurus for Russian”.

What is Unitex?
• An open-source corpus processor, based on automata-oriented
technology
• mainly developed by Sébastien Paumier at the Institut Gaspard-Monge
(IGM), University of Paris-Est Marne-la-Vallée (France)
• It works on Windows, Linux, Mac OS and other systems
• It has lexical resources for French, English, Greek, Portuguese, Russian,
Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more
• http://www-igm.univ-mlv.fr/~unitex/
2

What is corpus?
A corpus is a collection of pieces of language text in electronic form, selected
according to external criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic research.
Sinclair 2005
“
3

What is Finite State Transducer (FST)?
FST, is a type of ﬁnite automaton which maps between two sets of symbols.
We can visualize an FST as a two-tape automaton that recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc in
the ﬁnite-state machine with two symbol strings, one from each tape.
Jurafsky 2000
“
4

Simple sentence splitting FST
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде. {S} Достигали высоты ...
5

Get your corpus from a text ﬁle in Unitex
1. Run Unitex
• If you are working on Windows, the program will ask you to choose a
personal working directory, which you can change later in
Info>Preferences...>Directories .
2. Select Russian as your working language
• For each language that you will be using, for the ﬁrst time the
program will copy the root directory of that language to your
personal directory, except the dictionaries.
6

Get your corpus from a text ﬁle in Unitex
3. Open corpus-ru-dbpedia-short-dea-1000.csv from the
Corpus subfolder: Text > Open...
4. Preprocess the text
• Apply Sentence.grf in MERGE mode
• Apply Replace.grf in REPLACE mode
• Tokenize the text
• Apply all default dictionaries
• Analyze unknown words as free compound words
7

Preprocessing
• Sentence.grf splits the text into sentences, adding {S} tag before
the next sentence (language dependent)
• Replace.grf removes ¬ (soft hyphen) and converts no-break spaces
to spaces
• The standard separators (the space, the tab and the newline characters)
are normalized
8

Tokenization
• is language (alphabet) dependent
• Newlines in a text are replaced by spaces
• A token can be:
• the sentence delimiter {S}
• the stop marker {STOP} to delimit texts
• a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)}
• a contiguous sequence of letters (from alphabet.txt )
• one (and only one) non-letter character, e.g. a digit
9

Applying dictionaries
• consists of building the subset of dictionaries consisting only of forms
that are present in the text
• The corpus becomes "tagged", i.e. every token is assigned all possible
grammatical forms
• e.g. семью assigned these lexical tags:
семью,семья.N+anim(j)+gen(F):aeF
семью,.ADV
семью,семь.NUM+plur:t
10

Hyponyms and hypernyms
Unlike synonymy and antonymy, which are lexical relations between word
forms, hyponymy/hypernymy is a semantic relation between word meanings:
e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} .
Much attention has been devoted to hyponymy/hypernymy (variously called
subordination/superordination, subset/superset, or the ISA relation)...
“
11

Hyponyms and hypernyms
A concept represented by the synset {x, x ,...} is said to be a hyponym of the
concept represented by the synset {y, y ,...} if native speakers of English accept
sentences constructed from such frames as An x is a (kind of) y. The relation
can be represented by including in {x, x ,...} a pointer to its superordinate, and
including in {y, y ,...} pointers to its hyponyms.
Miller 1993
“
12

Hyponym and hypernym mining from
Russian texts
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S} Достигали
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых
крупных современных наземных млекопитающих —
африканских слонов .
13

Indicators
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S}
1. Text > Locate pattern...
2. Type род into Regular expression
3. Select Index all utterances in text in Search limitation
4. Click Search
14

Concordance
• hyponyms and hypernyms are nouns
• вымерший (participle) and широколиственных (adjective) can be
omitted
15

Patterns in Unitex
2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N>
3. Click Search
2 matches
Мамонты — вымерший род млекопитающих из семейства слоновых
Бук — род широколиственных деревьев семейства Буковые
01.
02.
16

Lexical masks
• <род> : matches all the entries that have род as canonical form
• <стать.V> : matches all entries having стать as canonical form and
the grammatical code V
• <V> : matches all entries having the grammatical code V
• {стану,стать.V} or <стану,стать.V> : matches all the entries
having стану as inﬂected form, стать as canonical form and the
grammatical code V
17

Lexical masks.Special symbols
• <E> : the empty word or epsilon. Matches the empty string
• <TOKEN> : matches any token, except the space; used by default for
morphological ﬁlters
• <MOT> : matches any token that consists of letters
• <MIN> : matches any lower-case token
• <MAJ> : matches any lower-case token
• <PRE> : matches any token that starts with a capital letter
18

Lexical masks.Special symbols
• <DIC> : matches any word that is present in the dictionaries of the text
• <SDIC> : matches any simple word in the text dictionaries
• <CDIC> : matches any composed word in the dictionaries of the text
• <TDIC> : matches any tagged token like {XXX,XXX.XXX}
• <NB> : matches any contiguous sequence of digit (1234 is matched but
not 1 234)
• <#> : prohibits the presence of space
19

Graphs in Unitex
• can match text (Finite State Automata)
• can produce new output text (Finite State Transducers)
• in MERGE mode combine the matched input text and the output text
(useful fot tagging)
• in REPLACE mode convert the matched input text into the output
text
20

1. FSGraph > New
2. Click on the initial state (arrow), click inside the empty place while
holding Ctrl to create a new box, connected to the initial state, type <N> ,
press Enter
21

A graph for matching text
3. Create a — box, connected to the <N> box
4. Create a род box, connected to the — box
5. Create a <N> box, connected to the род box
6. Click on the second <N> box, click on the ﬁnal state (a circle with a
square inside) to connect these 2 boxes
7. Create a <V:S> box between the — and род boxes
8. Create a <A>+<!DIC> box between the род and <N> boxes
9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save
22

A graph for matching text
Text > Locate Pattern... , Locate pattern in the form of: Graph, Set
match-hyponyms.grf , Search
23

1. Click on the ﬁrst <N> box (hyponym) and change it to <N>/{[ to add
{[ before the matched noun, when the graph is applied in the MERGE
mode
2. Click on the <N>/{[ and click on the — box to disconnect these boxes
3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes.
It will add ]=HYPONYM} after the matched noun
4. Modify the second <N> box for adding a HYPERNYM tag to it
24

5. Save the graph as tag-hyponyms.grf
25

Tagging hyponyms and hypernyms
2. Set tag-hyponyms.grf
3. Select Merge with input text in Grammar outputs
4. Click Search
5. Build concordance
• The matched and tagged texts are stored in the concord.ind
ﬁle in the corpus folder
corpus-ru-dbpedia-short-dea-1000_snt
26

Tagging hyponyms and hypernyms
{[Мамонты]=HYPONYM} — вымерший род
{[млекопитающих]=HYPERNYM} из семейства слоновых
{[Бук]=HYPONYM} — род широколиственных
{[деревьев]=HYPERNYM} семейства Буковые
• We can then use some script to extract tagged hyponyms and
hypernyms...
• or mine them right in Unitex in the REPLACE mode
01.
02.
27

Mining hyponyms and hypernyms
1. Open match-hyponyms.grf : FSGraph > Open...
2. Click on the ﬁrst <N> box, right-click on it and select
Surround with > Morphological mode
3. Click on the ﬁrst <N> box and change it to <N>/$hyponym$ to store
the matched noun with all morphological information in the
$hyponym$ variable
28

4. Modify the second <N> box to store the matched noun in variable
$hypernym$ in the morphological mode
5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the
ﬁnal state
6. Save this graph as mine-hyponyms.grf
7. In Info > Preferences... > Morphological dictionaries add
Dela/CISLEXru_igrok.bin
29

30

1. Set this graph in Text > Locate pattern...
2. Select Replace recognized sequences in Grammar outputs
3. Click Search
млекопитающее: мамонт
дерево: бук
дерево: бука
дерево: Бук
01.
02.
03.
04.
31

1. Why so many Бук outputs? Let's see in the dictionary: DELA >
Lookup... , select CISLEXru_igrok.bin and enter this word
Бук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM
32

2. Let's modify mine-hyponyms.grf to remove ambiguous outputs:
change the ﬁrst <N> box to <N~PN:n>
2 outputs
млекопитающее: мамонт
дерево: бук
01.
02.
33

References
1. Jurafsky, D., & James, H. (2000). Speech and language processing an
introduction to natural language processing, computational linguistics,
and speech.
2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).
Introduction to wordnet: An on-line lexical database*. International
journal of lexicography, 3(4), 235-244.
34

References
3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est
Marne-la-Vallée. January 15, 2015,
http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf
4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing
Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow
Books: 1-16. Available online from
http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].
35

Text Processing in Unitex
• PatternSim (github.com/cental/PatternSim) — a tool for calculation
semantic similarity between words from a text corpus based on lexico-
syntactic patterns
• Normatex (github.com/avlukanin/normatex) — Russian text normalization
for speech synthesis, machine translation and other natural language
processing tasks
• Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and
source ﬁles used in this tutorial
36

Text Processing with Finite State
Artem Lukanin
• about.me/alukanin
• @avlukanin
• artyom.lukanin@gmail.com
Slides: artyom.ice-lc.com/slides/unitextutorial
37

Text Processing with Finite State Transducers in Unitex

Recommended

Recommended

More Related Content

Similar to Text Processing with Finite State Transducers in Unitex

Similar to Text Processing with Finite State Transducers in Unitex (20)

More from Artem Lukanin

More from Artem Lukanin (20)

Recently uploaded

Recently uploaded (20)

Text Processing with Finite State Transducers in Unitex