Lidia Grigorieva
The Institute of Informatics Problems of the Russian
Academy of Sciences (IPI RAN)
Root	!=	Stem
из — prefix
бир — root
а, тель, ниц — suffixes
а — ending
избирательниц — stem
Dimension	reduction
— dimension reduction is the process of reducing the
number of random variables in machine learning
tasks:
— Lemmatization –grouping together the inflected
forms of a word. LemmaGen; morpha; pymorphy2,
mystem...
— Stemming –reducing inflected words to their word
stem. The stem need not be identical to
the morphological root of the word. Snowball;
Lovins; Porter; nltk.stem.* ...
— Root Extraction – reducing derivates to their root.,
i.e. meaning.
Lemmatization
Mapping from text-word to lemma
Text-word to Lemma
мыла мыть (verb)
wash
мыло(noun)
soap
Stemming
Mapping from text-word to stem (excluding
endings)
21
лесистый лесист
лесник лесник
лесничество лесничеств
лесничий леснич
лесной лесн
to
5
3
5
to
Root	extraction
Mapping from lemma to meaning
лесистый лес
лесник лес
лесничество лес
лесничий лес
лесной лес
5
1
to
Realization
— Neural Networks algorithm
— Train data – 749 cases
— Cross validation – 84 cases (10%)
— Test data – 93 cases
— Accuracy ~0.7
Tasks
— plagiarism;
— paraphrase detection;
— textual similarity;
— semantic disambiguation;
— topic model;
— text classification;
— text clusterization;
— question answering systems;
— building semantic graphs (entities, links and
relationship between them);
References
— РацибурскаяЛ.В. Словарь уникальных морфем
современногорусского языка М.: Флинта: Наука, 2009. — 160
с.
— Аванесов Р.И., Ожегов С.И. Морфемно-орфографический
словарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ:
Астрель, 2002. — 704 с.
— Тихонов А.Н. Морфемно-орфографический словарь русского
языка, 2002.
— Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского
языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с.
— http://old.kpfu.ru/infres/slovar1/begall.htm
— http://snowball.tartarus.org/algorithms/russian/stemmer.html,
http://snowballstem.org/demo.html
Effective Paraphrase Expansion in Addressing
Lexical Variability
Vasily Konovalov, Meni Adler, Ido Dagan
Department of Computer Science
Bar-Ilan University, Israel
The 5th conference on Artificial Intelligence and Natural
Language
Problem
Lexical Variability
From Negochat negotiation dialogue corpus:
‘Reject’: “I disagree”, “I reject your proposal”, “it’s not
accepted”.
‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”.
‘Offer’: “I offer you a salary of 60,000 USD”, “How about the
programmer position”, “I propose you a pension of 10%”.
Solution
Translation-based paraphrase expansion
PL
MT1 MT2
SENTENCE PARAPHRASE
Google Yandex
Our research questions
◮ What is the ‘best’ performing language? Why is it actually
the ‘best’ one?
◮ What is the ‘best’ performing combination of MT engines?
Our research settings
Languages: Portuguese, French, German, Hebrew, Russian,
Arabic, Finish, Chinese, Hungarian.
MT engines: Google Translate API, Microsoft Translator Text
API, Yandex Translate API.
Our findings
◮ Among tested languages Hungarian is the ‘best’ performing
one.
◮ The performance of a language correlates well with the
averaged smoothed BLEU.
◮ A language that generates the most lexically dissimilar
paraphrases is the ‘best’ performing language.
◮ The differences between MT engines are insignificant
according to the averaged smoothed BLEU and are not
reflected in evaluation.
◮ The language family relations are reflected in averaged
smoothed BLEU.
Come and see our poster
RESEARCHING
QUANTITATIVE
CHARACTERISTICS OF
SHORT TEXTS: SCIENTIFIC,
NEWS, USE WRITINGS
■ For data analysis, we used several texts
collection.
■ For scientific texts: Collection from the conference
Dialogue (to 2003-2006), and Corpus Linguistics.
■ For news: Collection is made up of mass media
short articles such as: Lenta.ru, the Russian
newspaper, RBC, Independent Newspaper, and
Kompyulenta.
■ To research writings from Unified State
Examination we created several collections,
”reference”, which contains writings written by
experts, and the second written by students.
■ For research we selected the most representative
characteristics: entropy, readability, lexical
diversity, verbal, autosem(all words, except for the
service parts of speech), and frequencies (the
ratio of the first hundred of the most frequent
words of the Russian language, to all words in the
text).
0
2
4
6
8
10
12
14
USE expert USE students News Scientific
Entropy
0
0,05
0,1
0,15
0,2
0,25
USE expert USE students News Scientific
Readability
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
USE expert USE students News Scientific
Lexical Diversity
0,136
0,138
0,14
0,142
0,144
0,146
0,148
0,15
0,152
0,154
USE expert USE students News Scientific
Verbal
0,68
0,7
0,72
0,74
0,76
0,78
0,8
USE expert USE students News Scientific
Autosem
0
0,05
0,1
0,15
0,2
0,25
0,3
USE expert USE students News Scientific
Frequencies
Building a Lexicon-Based
Lemmatizer for Old Irish
Oksana Dereza
oksana.dereza@gmail.com
Old Irish: Grammar
• Changes can occur to any part of the word
o beginning: mutations
o middle: infixed pronouns
o end: flections
caraid ‘he / she / it loves’
rob-car-si ‘she has loved you’
• Very differently looking forms in a paradigm (esp. verbal)
do-beir ‘gives, brings’
ní t(h)abair ‘does not give, bring’
Old Irish: Orthography
• Inconsistent use of length marks
• Mutations are not always shown in writing
• Complex verb forms can be spelled either with or without a hyphen or a whitespace
• In later texts there are mute vowels to indicate the quality (broad / slender) of consonants
next to them
⇨ a great number of possible spellings for every form
Consonant b c d f g l m n p r s t
Mutated
consonant
bh ch dh fh gh ll mh nn ph rr sh th
mb gc nd ḟ ng l-l mm bp ṡ dt
cc ḟh m-m ss
bhf ts
s-s
Data
• Dictionary of the Irish Language (DIL)
43,345 entries ⇨ 79,140 unique forms
• Corpus
125 texts, 831,280 tokens
• Gold standard
50 random sentences from the test corpus, 840 tokens
• Not only classical Old Irish
The corpus covers VII-XVI centuries
Problems
• DIL covers only ~ 41% of
unique forms in the corpus
• Many contracted forms, but
no unified system of
contractions
• Inconsistent use of markup
and punctuation
caraid
Cite this: eDIL s.v. caraid
or dil.ie/8212
Forms: -carim, -cairim,
caraim, -caraim, -caru, -
cari, carid, caraid, -cara,
carthai, caras, charas,
caris, carthar, -charam,
carait, charaíd, -carat,
cartae, cardda, carda,
carde, cartar, carad,
caram, carid, -carid, -
carad, carad, carthae, -
chartais, carddais, cardáis,
care, -charae, -carae, cara,
-rochra, -chara, cara, -
carat, -carad, -charad,
cechar, -cechra, -cechra,
cechras, -chechrat, -
cechrainn, carais, carois, -
cair, carsait, carsat,
charus, rob-car-si, ro-car,
arro-car, char, rondob-
carsam-ni, charsat,
charsad, ros-carsat, serc,
carthain, carthi
weak vb. with reduplicated fut. on
analogy of canaid ( Thurn. Gramm.
402 ). Ind. pres. 1 s. -carim, Wb. 5c7
. -cairim, 23c12 . caraim, Thes. ii
293.16 . -caraim, Ml. 79d1 . -caru,
Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s.
carid, Wb. 25d5 . caraid , Ml. 75c4 . -
cara, Wb. 27d9 . With suff. pron. 3 s.
m. carthai, Fráech 10 . Rel. caras,
Wb. 25c19 . Ml. 91b17 . charas, 30c3
. caris, Thes. ii 247.4 . Pass. rel.
carthar, Ml. 75c4 . Sg. 193b3 . 196b4
.. <…>
(a) loves (persons): nád carad som
Iudeiu, Wb. 4d17 . carad uir
mulierem, 22c19 . carsus fiadhu,
Snedg. u. Mac R. 11.5 . rot charus ar
th'airscélaib I have fallen in love
with thee, LU 6084 (TBC). nít
charadar nít tágedar, TBC 2032 = -
chara, LU 5797 . car do chomnesam
amal no-t-cara fén = dilige
proximum, PH 5837 . gé no
charfuinn fiche fear, KMMisc. 362.7
. a fhir Chola charuid mná `beloved
of women', Sc.G. St. iv 62 § 10 . ní
charabh bean tsean ná óg, Dánta Gr.
78.11 . <…>
Lemmatizer
• Two methods for OOV-words
o Baseline: return a demutated form
o Predict a lemma using modified Damerau-Levenshtein
distance
• Disambiguation
o For homonymous forms, the lemma with the highest lexical
probability is chosen
o Lemma probability equals the sum of probabilities of its forms,
and form probability is its frequency count in the corpus
Predicting lemmas for OOV-words
• Generate all possible strings on edit distance 1 and 2
• Check them up in the dictionary
• Add real words to candidate list
• Filter candidates by the first character
“If the unknown word starts with a vowel, the candidate should also start
with a vowel, and if the unknown word starts with a consonant, the
candidate should start with the same consonant”
• The lemma of the candidate with the highest lexical probability (i.e.
frequency count in the corpus) is taken as a lemma for the unknown word
Evaluation
Lexicon Forms ‘Recall’
DIL forms only 79,140 74.7 %
DIL + 1000 most frequent OOV-words 80,206 80.0 %
! 4,889 homonymous forms
Baseline Predicted lemmas
Lemmatized correctly 483 / 840 552 / 840
Accuracy 57,50 % 65,71 %
Evaluation
Tokens 840
Known words 654
Unknown words 186
Lemmatized correctly 552
Lemmas predicted for unknown words 157
Predicted correctly 84
Predicted incorrectly 68
Several lemmas predicted including the
correct one, but the wrong one is chosen
5
~ 60 % of lemmas are predicted correctly
Token Best candidate
from closest
dictionary forms
Best candidate’s
lemma
Chosen lemma
+ eólais eólas eólas eólas
+ fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich
+ cheast ceist ceist ceist
* déa dia dá, de, do, día de
+ bréithir bréthir bríathar bríathar
– n-uaill aill aile, aill, all, aille aile
– chuain cain cain, canaid, cani,
caingen
canaid
– christ ceist ceist ceist
– caeme caíme caíme caíme
– chniss cliss cles cles
Predicted lemmas
Source Code & Corpora
Source code
https://github.com/ancatmara/old_irish_lemmatizer
Texts
https://github.com/ancatmara/old_irish_corpora
Extraction of Social
Networks from Literary Text
Tsygankova Viktoria,
National Research University
Higher School of Economics, Moscow
NovelGraphs
a tool for automatic annotation
of texts and for extracting social
networks of characters from text,
where nodes represent
characters and edges are
relations between them.
It can also analyze structural
balance of the resulting graphs.
prince paradox
duke de valentinois
henry wotton
narborough
borgia
filippo
hallward
louis xii
lady henry
erskine
adrian
gian maria visconti
romeo
gray
mercutio
ruxton
Example graph of the “Picture
of Dorian Gray” by Oscar Wilde
Example graph of the “Study
in Scarlet”
by A. Conan Doyle
lestrade
gregson
murcher
rance
holmes
narrator
eph stangerson
Example graph of the “Study
in Scarlet” by A. Conan Doyle
with sentiment
Example graph of the “Picture
of Dorian Gray” by Oscar Wilde
with sentient
Conclusions
  A tool NovelGraphs was created for
English-language literary fiction, which
uses a new approach of extracting characters
and connections between them.
  Nodes represent characters found in the text,
and edges connect them to other characters
with whom they interact.
  At the moment, combinations of extractors and
aggregators detect characters better than
interactions between them.
  Analysis of structural balance identifies key
passages of the text that correspond to the
minima and maxima on the balance plot.
Thanks for watching!
Are the results of your corpus
research really reliable?
Getting automatic result analysis on
GICR.
Tatiana Shavrina, Daniil Selegey
AINL FRUCT, SPb, 12.11.2016
Big Corpora Problem:
1. Billions of words, mostly coming from
social media
2. Getting just the IPM and search
results in KWIC format doesn’t tell
you if the results are biased
3. A lot of metatext attributes – URLs,
doc IDs, author IDs, region, gender,
genre etc. – all are potential source
of bias
Users need corpus tools to see all statistics of the
search area to check for homogeneity with the
whole corpus.
Our solution:
Search results analysis right in the interface!
See you at our
Demo stand!

AINL 2016: Grigorieva

  • 1.
    Lidia Grigorieva The Instituteof Informatics Problems of the Russian Academy of Sciences (IPI RAN)
  • 2.
    Root != Stem из — prefix бир— root а, тель, ниц — suffixes а — ending избирательниц — stem
  • 3.
    Dimension reduction — dimension reductionis the process of reducing the number of random variables in machine learning tasks: — Lemmatization –grouping together the inflected forms of a word. LemmaGen; morpha; pymorphy2, mystem... — Stemming –reducing inflected words to their word stem. The stem need not be identical to the morphological root of the word. Snowball; Lovins; Porter; nltk.stem.* ... — Root Extraction – reducing derivates to their root., i.e. meaning.
  • 4.
    Lemmatization Mapping from text-wordto lemma Text-word to Lemma мыла мыть (verb) wash мыло(noun) soap
  • 5.
    Stemming Mapping from text-wordto stem (excluding endings) 21 лесистый лесист лесник лесник лесничество лесничеств лесничий леснич лесной лесн to 5 3 5 to
  • 6.
    Root extraction Mapping from lemmato meaning лесистый лес лесник лес лесничество лес лесничий лес лесной лес 5 1 to
  • 7.
    Realization — Neural Networksalgorithm — Train data – 749 cases — Cross validation – 84 cases (10%) — Test data – 93 cases — Accuracy ~0.7
  • 8.
    Tasks — plagiarism; — paraphrasedetection; — textual similarity; — semantic disambiguation; — topic model; — text classification; — text clusterization; — question answering systems; — building semantic graphs (entities, links and relationship between them);
  • 9.
    References — РацибурскаяЛ.В. Словарьуникальных морфем современногорусского языка М.: Флинта: Наука, 2009. — 160 с. — Аванесов Р.И., Ожегов С.И. Морфемно-орфографический словарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ: Астрель, 2002. — 704 с. — Тихонов А.Н. Морфемно-орфографический словарь русского языка, 2002. — Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с. — http://old.kpfu.ru/infres/slovar1/begall.htm — http://snowball.tartarus.org/algorithms/russian/stemmer.html, http://snowballstem.org/demo.html
  • 10.
    Effective Paraphrase Expansionin Addressing Lexical Variability Vasily Konovalov, Meni Adler, Ido Dagan Department of Computer Science Bar-Ilan University, Israel The 5th conference on Artificial Intelligence and Natural Language
  • 11.
    Problem Lexical Variability From Negochatnegotiation dialogue corpus: ‘Reject’: “I disagree”, “I reject your proposal”, “it’s not accepted”. ‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”. ‘Offer’: “I offer you a salary of 60,000 USD”, “How about the programmer position”, “I propose you a pension of 10%”.
  • 12.
    Solution Translation-based paraphrase expansion PL MT1MT2 SENTENCE PARAPHRASE Google Yandex
  • 13.
    Our research questions ◮What is the ‘best’ performing language? Why is it actually the ‘best’ one? ◮ What is the ‘best’ performing combination of MT engines?
  • 14.
    Our research settings Languages:Portuguese, French, German, Hebrew, Russian, Arabic, Finish, Chinese, Hungarian. MT engines: Google Translate API, Microsoft Translator Text API, Yandex Translate API.
  • 15.
    Our findings ◮ Amongtested languages Hungarian is the ‘best’ performing one. ◮ The performance of a language correlates well with the averaged smoothed BLEU. ◮ A language that generates the most lexically dissimilar paraphrases is the ‘best’ performing language. ◮ The differences between MT engines are insignificant according to the averaged smoothed BLEU and are not reflected in evaluation. ◮ The language family relations are reflected in averaged smoothed BLEU.
  • 16.
    Come and seeour poster
  • 17.
  • 18.
    ■ For dataanalysis, we used several texts collection. ■ For scientific texts: Collection from the conference Dialogue (to 2003-2006), and Corpus Linguistics. ■ For news: Collection is made up of mass media short articles such as: Lenta.ru, the Russian newspaper, RBC, Independent Newspaper, and Kompyulenta. ■ To research writings from Unified State Examination we created several collections, ”reference”, which contains writings written by experts, and the second written by students.
  • 19.
    ■ For researchwe selected the most representative characteristics: entropy, readability, lexical diversity, verbal, autosem(all words, except for the service parts of speech), and frequencies (the ratio of the first hundred of the most frequent words of the Russian language, to all words in the text).
  • 20.
    0 2 4 6 8 10 12 14 USE expert USEstudents News Scientific Entropy
  • 21.
    0 0,05 0,1 0,15 0,2 0,25 USE expert USEstudents News Scientific Readability
  • 22.
    0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 USE expert USEstudents News Scientific Lexical Diversity
  • 23.
  • 24.
    0,68 0,7 0,72 0,74 0,76 0,78 0,8 USE expert USEstudents News Scientific Autosem
  • 25.
    0 0,05 0,1 0,15 0,2 0,25 0,3 USE expert USEstudents News Scientific Frequencies
  • 26.
    Building a Lexicon-Based Lemmatizerfor Old Irish Oksana Dereza oksana.dereza@gmail.com
  • 27.
    Old Irish: Grammar •Changes can occur to any part of the word o beginning: mutations o middle: infixed pronouns o end: flections caraid ‘he / she / it loves’ rob-car-si ‘she has loved you’ • Very differently looking forms in a paradigm (esp. verbal) do-beir ‘gives, brings’ ní t(h)abair ‘does not give, bring’
  • 28.
    Old Irish: Orthography •Inconsistent use of length marks • Mutations are not always shown in writing • Complex verb forms can be spelled either with or without a hyphen or a whitespace • In later texts there are mute vowels to indicate the quality (broad / slender) of consonants next to them ⇨ a great number of possible spellings for every form Consonant b c d f g l m n p r s t Mutated consonant bh ch dh fh gh ll mh nn ph rr sh th mb gc nd ḟ ng l-l mm bp ṡ dt cc ḟh m-m ss bhf ts s-s
  • 29.
    Data • Dictionary ofthe Irish Language (DIL) 43,345 entries ⇨ 79,140 unique forms • Corpus 125 texts, 831,280 tokens • Gold standard 50 random sentences from the test corpus, 840 tokens • Not only classical Old Irish The corpus covers VII-XVI centuries
  • 30.
    Problems • DIL coversonly ~ 41% of unique forms in the corpus • Many contracted forms, but no unified system of contractions • Inconsistent use of markup and punctuation caraid Cite this: eDIL s.v. caraid or dil.ie/8212 Forms: -carim, -cairim, caraim, -caraim, -caru, - cari, carid, caraid, -cara, carthai, caras, charas, caris, carthar, -charam, carait, charaíd, -carat, cartae, cardda, carda, carde, cartar, carad, caram, carid, -carid, - carad, carad, carthae, - chartais, carddais, cardáis, care, -charae, -carae, cara, -rochra, -chara, cara, - carat, -carad, -charad, cechar, -cechra, -cechra, cechras, -chechrat, - cechrainn, carais, carois, - cair, carsait, carsat, charus, rob-car-si, ro-car, arro-car, char, rondob- carsam-ni, charsat, charsad, ros-carsat, serc, carthain, carthi weak vb. with reduplicated fut. on analogy of canaid ( Thurn. Gramm. 402 ). Ind. pres. 1 s. -carim, Wb. 5c7 . -cairim, 23c12 . caraim, Thes. ii 293.16 . -caraim, Ml. 79d1 . -caru, Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s. carid, Wb. 25d5 . caraid , Ml. 75c4 . - cara, Wb. 27d9 . With suff. pron. 3 s. m. carthai, Fráech 10 . Rel. caras, Wb. 25c19 . Ml. 91b17 . charas, 30c3 . caris, Thes. ii 247.4 . Pass. rel. carthar, Ml. 75c4 . Sg. 193b3 . 196b4 .. <…> (a) loves (persons): nád carad som Iudeiu, Wb. 4d17 . carad uir mulierem, 22c19 . carsus fiadhu, Snedg. u. Mac R. 11.5 . rot charus ar th'airscélaib I have fallen in love with thee, LU 6084 (TBC). nít charadar nít tágedar, TBC 2032 = - chara, LU 5797 . car do chomnesam amal no-t-cara fén = dilige proximum, PH 5837 . gé no charfuinn fiche fear, KMMisc. 362.7 . a fhir Chola charuid mná `beloved of women', Sc.G. St. iv 62 § 10 . ní charabh bean tsean ná óg, Dánta Gr. 78.11 . <…>
  • 31.
    Lemmatizer • Two methodsfor OOV-words o Baseline: return a demutated form o Predict a lemma using modified Damerau-Levenshtein distance • Disambiguation o For homonymous forms, the lemma with the highest lexical probability is chosen o Lemma probability equals the sum of probabilities of its forms, and form probability is its frequency count in the corpus
  • 32.
    Predicting lemmas forOOV-words • Generate all possible strings on edit distance 1 and 2 • Check them up in the dictionary • Add real words to candidate list • Filter candidates by the first character “If the unknown word starts with a vowel, the candidate should also start with a vowel, and if the unknown word starts with a consonant, the candidate should start with the same consonant” • The lemma of the candidate with the highest lexical probability (i.e. frequency count in the corpus) is taken as a lemma for the unknown word
  • 33.
    Evaluation Lexicon Forms ‘Recall’ DILforms only 79,140 74.7 % DIL + 1000 most frequent OOV-words 80,206 80.0 % ! 4,889 homonymous forms Baseline Predicted lemmas Lemmatized correctly 483 / 840 552 / 840 Accuracy 57,50 % 65,71 %
  • 34.
    Evaluation Tokens 840 Known words654 Unknown words 186 Lemmatized correctly 552 Lemmas predicted for unknown words 157 Predicted correctly 84 Predicted incorrectly 68 Several lemmas predicted including the correct one, but the wrong one is chosen 5 ~ 60 % of lemmas are predicted correctly
  • 35.
    Token Best candidate fromclosest dictionary forms Best candidate’s lemma Chosen lemma + eólais eólas eólas eólas + fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich + cheast ceist ceist ceist * déa dia dá, de, do, día de + bréithir bréthir bríathar bríathar – n-uaill aill aile, aill, all, aille aile – chuain cain cain, canaid, cani, caingen canaid – christ ceist ceist ceist – caeme caíme caíme caíme – chniss cliss cles cles Predicted lemmas
  • 36.
    Source Code &Corpora Source code https://github.com/ancatmara/old_irish_lemmatizer Texts https://github.com/ancatmara/old_irish_corpora
  • 37.
    Extraction of Social Networksfrom Literary Text Tsygankova Viktoria, National Research University Higher School of Economics, Moscow
  • 38.
    NovelGraphs a tool forautomatic annotation of texts and for extracting social networks of characters from text, where nodes represent characters and edges are relations between them. It can also analyze structural balance of the resulting graphs.
  • 39.
    prince paradox duke devalentinois henry wotton narborough borgia filippo hallward louis xii lady henry erskine adrian gian maria visconti romeo gray mercutio ruxton Example graph of the “Picture of Dorian Gray” by Oscar Wilde
  • 40.
    Example graph ofthe “Study in Scarlet” by A. Conan Doyle lestrade gregson murcher rance holmes narrator eph stangerson
  • 41.
    Example graph ofthe “Study in Scarlet” by A. Conan Doyle with sentiment
  • 42.
    Example graph ofthe “Picture of Dorian Gray” by Oscar Wilde with sentient
  • 43.
    Conclusions   A toolNovelGraphs was created for English-language literary fiction, which uses a new approach of extracting characters and connections between them.   Nodes represent characters found in the text, and edges connect them to other characters with whom they interact.   At the moment, combinations of extractors and aggregators detect characters better than interactions between them.   Analysis of structural balance identifies key passages of the text that correspond to the minima and maxima on the balance plot.
  • 44.
  • 45.
    Are the resultsof your corpus research really reliable? Getting automatic result analysis on GICR. Tatiana Shavrina, Daniil Selegey AINL FRUCT, SPb, 12.11.2016
  • 46.
    Big Corpora Problem: 1.Billions of words, mostly coming from social media 2. Getting just the IPM and search results in KWIC format doesn’t tell you if the results are biased 3. A lot of metatext attributes – URLs, doc IDs, author IDs, region, gender, genre etc. – all are potential source of bias Users need corpus tools to see all statistics of the search area to check for homogeneity with the whole corpus.
  • 47.
    Our solution: Search resultsanalysis right in the interface!
  • 48.
    See you atour Demo stand!