Morphosyntactic analysis for stylometry

Morphosyntactic analysis
for stylometry
Silvie Cinková
cinkova@ufal.mff.cuni.cz
COST CA16204 Distant Reading for European Literary History
2018-04-18, Kraków

What does morphosyntax tell you?
Mary knew the fair
young man who looked
like a boy.

Universal Dependencies
• universaldependencies.org
• framework for cross-linguistically consistent
grammatical annotation
• 60+ languages
• 100+ treebanks (syntactically analyzed
corpora)
• parsers!
• all open source

Universal POS-tags
• Language-specific tagsets depend on traditional grammars: excessive
diversity
• UD is mapping specific tagsets to a common scheme.
English - Penn Treebank tagset
Polish - IPIPAN tagset

Raw parser output: CoNLL-U Format
http://universaldependencies.org/format.html

Online services
http://clarin-pl.eu/en/services/
ws-test.clarin-pl.eu
Egipt zdobyć się na taki armia
pracownik i on zawdzięczać swój
wiekopomny dzieło.

UDPipe (incl. MorphoDiTa)
lindat.mff.cuni.cz/en/services#UDPipe

UD traps
• coordinations
• copula predicates
• elided verbs

How to get your conllu file
• Mostly with these parameters:
– some model except Baseline UD
– Tag and Lemmatize
– Parse
– Advanced options
• Input: Tokenize plain text
• tick nothing in Tokenizer if you have plain text without
any tags
• Your text has to be encoded in UTF-8.

When you have syntactic suspicions...
– Compare occurrences of selected syntactic
phenomena
• number of verbal clauses in a sentence
• tree depth
• multiple attributes
• preference for prepositional noun modifiers or
compounds (cat admirer vs. admirer of cats)
– Extract those phenomena from your texts.

Querying a UD-treebank
• https://lindat.mff.cuni.cz/services/pmltq/#!/home
• PMLTQ (Tree Query Language)
– "draw" the subtree you want to extract from the corpus
– view the results - tweak query - view - tweak - view...
– count them
– or group according to additional criteria and count
groupwise

Pitfalls of Tree Query online
• Tree Query Engine
– UD nodes are called a-node (and a-root)
– deprel => conll/deprel
– upos => conll/cpos
– xpos = conll/pos
• UD versions slightly differ in labeling.

Case: How are copula-verb
complements labeled?
• Follow query here:
http://hdl.handle.net/11346/PMLTQ-ROGZ

Query
http://hdl.handle.net/11346/PMLTQ-ROGZ
1. a-node [, , , ]
2. a-node $blabla := [, , , ]
... and now for real:
a-node $blabla := [,, a-node $blablas_child := [] , ]
a-node $copula_complement :=
[conll/cpos in {"NOUN", "ADJ"}, a-node $copula_verb :=
[conll/deprel = "cop", conll/cpos = "AUX"] ]

Filter query
Learn which deprels the complements had and
how many of which!
• >> this sign introduces the filter
for $my_node.attribute give $1, count()
• give $1 = give me the first column of a table. These are the
values of the conll/deprel attribute, that is, the deprels . Mind the
dot between node name and attribute name!

Querying your own corpus with
PMLTQ locally with TrEd

Inner structure of a ud.node in TrEd
Attributes and values depend
on the language model you
have selected in UDPipe.

Query files
• Save your query file where you
want
• Save it as the first option (PML)
and approve a warning message
that occurs.
• Or create a new file by File ->
New -> Based on the Current
File
- mind to have the cursor in the
query file
• click the New query button
• to see your previous queries,
PageUp

TrEd installation
• download installer for your OS here:
http://ufal.mff.cuni.cz/tools/tred
Important libraries are sometimes to be installed
manually like this:
• in your command line go to the tred directory
cd C:tred on Windows (put tred right on C:, never into
Program Files!!!)
tred.bat
cpan -T library name from the error message you've seen;
repeat with each library. It takes time to install!

TrEd configuration (clickable in TrEd)
• Setup -> Manage Extensions
– install:

When you just want to load stuff into
Stylo
• The conllu file is actually a data.frame in R.
• Select the rows and column(s) you want, and
convert to a text file or a plain text string
again.

Read the conllu file
bur_garden_df <- read.table(
file = "burnett_garden_1911.conllu",
sep = "t",
fileEncoding = "UTF-8",
header = FALSE,
comment.char = "#",
blank.lines.skip = TRUE,
quote = "",
col.names = c("id", "form", "lemma", "upos", "xpos",
"feats", "head", "deprel", "misc1", "misc2") )

UDPipe in R
• https://cran.r-project.org/web/packages/udpipe/
• have your files processed by UDPipe's API
• train your own model

Morphosyntactic analysis for stylometry

Recommended

Recommended

More Related Content

Similar to Morphosyntactic analysis for stylometry

Similar to Morphosyntactic analysis for stylometry (20)

Recently uploaded

Recently uploaded (20)

Morphosyntactic analysis for stylometry

Editor's Notes