Text Processing with Finite State Transducers in Unitex
1. ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
This work is partially supported by the RFH grant #13-04-12020
“New open electronic thesaurus for Russian”.
2. What is Unitex?
• An open-source corpus processor, based on automata-oriented
technology
• mainly developed by Sébastien Paumier at the Institut Gaspard-Monge
(IGM), University of Paris-Est Marne-la-Vallée (France)
• It works on Windows, Linux, Mac OS and other systems
• It has lexical resources for French, English, Greek, Portuguese, Russian,
Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more
• http://www-igm.univ-mlv.fr/~unitex/
2
3. What is corpus?
A corpus is a collection of pieces of language text in electronic form, selected
according to external criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic research.
Sinclair 2005
“
3
4. What is Finite State Transducer (FST)?
FST, is a type of finite automaton which maps between two sets of symbols.
We can visualize an FST as a two-tape automaton that recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc in
the finite-state machine with two symbol strings, one from each tape.
Jurafsky 2000
“
4
5. Simple sentence splitting FST
... в четвертичном периоде. Достигали высоты ...
... в четвертичном периоде. {S} Достигали высоты ...
5
6. Get your corpus from a text file in Unitex
1. Run Unitex
• If you are working on Windows, the program will ask you to choose a
personal working directory, which you can change later in
Info>Preferences...>Directories .
2. Select Russian as your working language
• For each language that you will be using, for the first time the
program will copy the root directory of that language to your
personal directory, except the dictionaries.
6
7. Get your corpus from a text file in Unitex
3. Open corpus-ru-dbpedia-short-dea-1000.csv from the
Corpus subfolder: Text > Open...
4. Preprocess the text
• Apply Sentence.grf in MERGE mode
• Apply Replace.grf in REPLACE mode
• Tokenize the text
• Apply all default dictionaries
• Analyze unknown words as free compound words
7
8. Preprocessing
• Sentence.grf splits the text into sentences, adding {S} tag before
the next sentence (language dependent)
• Replace.grf removes ¬ (soft hyphen) and converts no-break spaces
to spaces
• The standard separators (the space, the tab and the newline characters)
are normalized
8
9. Tokenization
• is language (alphabet) dependent
• Newlines in a text are replaced by spaces
• A token can be:
• the sentence delimiter {S}
• the stop marker {STOP} to delimit texts
• a lexical tag, e.g. {ЮУрГУ,.N+ORG+gen(M)}
• a contiguous sequence of letters (from alphabet.txt )
• one (and only one) non-letter character, e.g. a digit
9
10. Applying dictionaries
• consists of building the subset of dictionaries consisting only of forms
that are present in the text
• The corpus becomes "tagged", i.e. every token is assigned all possible
grammatical forms
• e.g. семью assigned these lexical tags:
семью,семья.N+anim(j)+gen(F):aeF
семью,.ADV
семью,семь.NUM+plur:t
10
11. Hyponyms and hypernyms
Unlike synonymy and antonymy, which are lexical relations between word
forms, hyponymy/hypernymy is a semantic relation between word meanings:
e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} .
Much attention has been devoted to hyponymy/hypernymy (variously called
subordination/superordination, subset/superset, or the ISA relation)...
“
11
12. Hyponyms and hypernyms
A concept represented by the synset {x, x ,...} is said to be a hyponym of the
concept represented by the synset {y, y ,...} if native speakers of English accept
sentences constructed from such frames as An x is a (kind of) y. The relation
can be represented by including in {x, x ,...} a pointer to its superordinate, and
including in {y, y ,...} pointers to its hyponyms.
Miller 1993
“
12
13. Hyponym and hypernym mining from
Russian texts
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S} Достигали
высоты 5,5 метров и массы тела 10—12 тонн.{S}
Таким образом, мамонты были в два раза тяжелее самых
крупных современных наземных млекопитающих —
африканских слонов .
13
14. Indicators
Мамонты — вымерший род млекопитающих из семейства
слоновых, живший в четвертичном периоде.{S}
1. Text > Locate pattern...
2. Type род into Regular expression
3. Select Index all utterances in text in Search limitation
4. Click Search
14
15. Concordance
• hyponyms and hypernyms are nouns
• вымерший (participle) and широколиственных (adjective) can be
omitted
15
16. Patterns in Unitex
1. Text > Locate pattern...
2. Regular expression <N> — <V:S>* род (<A>+<!DIC>)* <N>
3. Click Search
2 matches
Мамонты — вымерший род млекопитающих из семейства слоновых
Бук — род широколиственных деревьев семейства Буковые
01.
02.
16
17. Lexical masks
• <род> : matches all the entries that have род as canonical form
• <стать.V> : matches all entries having стать as canonical form and
the grammatical code V
• <V> : matches all entries having the grammatical code V
• {стану,стать.V} or <стану,стать.V> : matches all the entries
having стану as inflected form, стать as canonical form and the
grammatical code V
17
18. Lexical masks.Special symbols
• <E> : the empty word or epsilon. Matches the empty string
• <TOKEN> : matches any token, except the space; used by default for
morphological filters
• <MOT> : matches any token that consists of letters
• <MIN> : matches any lower-case token
• <MAJ> : matches any lower-case token
• <PRE> : matches any token that starts with a capital letter
18
19. Lexical masks.Special symbols
• <DIC> : matches any word that is present in the dictionaries of the text
• <SDIC> : matches any simple word in the text dictionaries
• <CDIC> : matches any composed word in the dictionaries of the text
• <TDIC> : matches any tagged token like {XXX,XXX.XXX}
• <NB> : matches any contiguous sequence of digit (1234 is matched but
not 1 234)
• <#> : prohibits the presence of space
19
20. Graphs in Unitex
• can match text (Finite State Automata)
• can produce new output text (Finite State Transducers)
• in MERGE mode combine the matched input text and the output text
(useful fot tagging)
• in REPLACE mode convert the matched input text into the output
text
20
21. 1. FSGraph > New
2. Click on the initial state (arrow), click inside the empty place while
holding Ctrl to create a new box, connected to the initial state, type <N> ,
press Enter
21
22. A graph for matching text
3. Create a — box, connected to the <N> box
4. Create a род box, connected to the — box
5. Create a <N> box, connected to the род box
6. Click on the second <N> box, click on the final state (a circle with a
square inside) to connect these 2 boxes
7. Create a <V:S> box between the — and род boxes
8. Create a <A>+<!DIC> box between the род and <N> boxes
9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save
22
23. A graph for matching text
Text > Locate Pattern... , Locate pattern in the form of: Graph, Set
match-hyponyms.grf , Search
23
24. Transducers in Unitex
1. Click on the first <N> box (hyponym) and change it to <N>/{[ to add
{[ before the matched noun, when the graph is applied in the MERGE
mode
2. Click on the <N>/{[ and click on the — box to disconnect these boxes
3. Create a <E>/]=HYPONYM} box between the <N>/{[ and — boxes.
It will add ]=HYPONYM} after the matched noun
4. Modify the second <N> box for adding a HYPERNYM tag to it
24
26. Tagging hyponyms and hypernyms
1. Text > Locate pattern...
2. Set tag-hyponyms.grf
3. Select Merge with input text in Grammar outputs
4. Click Search
5. Build concordance
• The matched and tagged texts are stored in the concord.ind
file in the corpus folder
corpus-ru-dbpedia-short-dea-1000_snt
26
27. Tagging hyponyms and hypernyms
{[Мамонты]=HYPONYM} — вымерший род
{[млекопитающих]=HYPERNYM} из семейства слоновых
{[Бук]=HYPONYM} — род широколиственных
{[деревьев]=HYPERNYM} семейства Буковые
• We can then use some script to extract tagged hyponyms and
hypernyms...
• or mine them right in Unitex in the REPLACE mode
01.
02.
27
28. Mining hyponyms and hypernyms
1. Open match-hyponyms.grf : FSGraph > Open...
2. Click on the first <N> box, right-click on it and select
Surround with > Morphological mode
3. Click on the first <N> box and change it to <N>/$hyponym$ to store
the matched noun with all morphological information in the
$hyponym$ variable
28
29. Mining hyponyms and hypernyms
4. Modify the second <N> box to store the matched noun in variable
$hypernym$ in the morphological mode
5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the
final state
6. Save this graph as mine-hyponyms.grf
7. In Info > Preferences... > Morphological dictionaries add
Dela/CISLEXru_igrok.bin
29
31. Mining hyponyms and hypernyms
1. Set this graph in Text > Locate pattern...
2. Select Replace recognized sequences in Grammar outputs
3. Click Search
млекопитающее: мамонт
дерево: бук
дерево: бука
дерево: Бук
01.
02.
03.
04.
31
32. Mining hyponyms and hypernyms
1. Why so many Бук outputs? Let's see in the dictionary: DELA >
Lookup... , select CISLEXru_igrok.bin and enter this word
Бук,.N+FAMN+PN+anim(o)+gen(M):neM
Бук,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d
бук,бука.N+anim(o)+gen(F)+gen(M):gm:aom
бук,.N+anim(j)+gen(M):neM:ajeM
32
33. Mining hyponyms and hypernyms
2. Let's modify mine-hyponyms.grf to remove ambiguous outputs:
change the first <N> box to <N~PN:n>
2 outputs
млекопитающее: мамонт
дерево: бук
01.
02.
33
34. References
1. Jurafsky, D., & James, H. (2000). Speech and language processing an
introduction to natural language processing, computational linguistics,
and speech.
2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).
Introduction to wordnet: An on-line lexical database*. International
journal of lexicography, 3(4), 235-244.
34
35. References
3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Université Paris-Est
Marne-la-Vallée. January 15, 2015,
http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf
4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing
Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow
Books: 1-16. Available online from
http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].
35
36. Text Processing in Unitex
• PatternSim (github.com/cental/PatternSim) — a tool for calculation
semantic similarity between words from a text corpus based on lexico-
syntactic patterns
• Normatex (github.com/avlukanin/normatex) — Russian text normalization
for speech synthesis, machine translation and other natural language
processing tasks
• Unitext Tutorial (github.com/avlukanin/unitextutorial) — the slides and
source files used in this tutorial
36
37. Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
• about.me/alukanin
• @avlukanin
• artyom.lukanin@gmail.com
Slides: artyom.ice-lc.com/slides/unitextutorial
37