SlideShare a Scribd company logo
1 of 38
Morphosyntactic analysis
for stylometry
Silvie Cinková
cinkova@ufal.mff.cuni.cz
COST CA16204 Distant Reading for European Literary History
2018-04-18, Kraków
What does morphosyntax tell you?
Mary knew the fair
young man who looked
like a boy.
Universal Dependencies
• universaldependencies.org
• framework for cross-linguistically consistent
grammatical annotation
• 60+ languages
• 100+ treebanks (syntactically analyzed
corpora)
• parsers!
• all open source
Universal POS-tags
• Language-specific tagsets depend on traditional grammars: excessive
diversity
• UD is mapping specific tagsets to a common scheme.
English - Penn Treebank tagset
Polish - IPIPAN tagset
Raw parser output: CoNLL-U Format
http://universaldependencies.org/format.html
Online services
http://clarin-pl.eu/en/services/
ws-test.clarin-pl.eu
Egipt zdobyć się na taki armia
pracownik i on zawdzięczać swój
wiekopomny dzieło.
UDPipe (incl. MorphoDiTa)
lindat.mff.cuni.cz/en/services#UDPipe
UD traps
• coordinations
• copula predicates
• elided verbs
Look, how smart...
Find the problem
Apposition
How to get your conllu file
• Mostly with these parameters:
– some model except Baseline UD
– Tag and Lemmatize
– Parse
– Advanced options
• Input: Tokenize plain text
• tick nothing in Tokenizer if you have plain text without
any tags
• Your text has to be encoded in UTF-8.
When you have syntactic suspicions...
– Compare occurrences of selected syntactic
phenomena
• number of verbal clauses in a sentence
• tree depth
• multiple attributes
• preference for prepositional noun modifiers or
compounds (cat admirer vs. admirer of cats)
– Extract those phenomena from your texts.
Querying a UD-treebank
• https://lindat.mff.cuni.cz/services/pmltq/#!/home
• PMLTQ (Tree Query Language)
– "draw" the subtree you want to extract from the corpus
– view the results - tweak query - view - tweak - view...
– count them
– or group according to additional criteria and count
groupwise
Pitfalls of Tree Query online
• Tree Query Engine
– UD nodes are called a-node (and a-root)
– deprel => conll/deprel
– upos => conll/cpos
– xpos = conll/pos
• UD versions slightly differ in labeling.
Case: How are copula-verb
complements labeled?
• Follow query here:
http://hdl.handle.net/11346/PMLTQ-ROGZ
Query
http://hdl.handle.net/11346/PMLTQ-ROGZ
1. a-node [, , , ]
2. a-node $blabla := [, , , ]
... and now for real:
a-node $blabla := [,, a-node $blablas_child := [] , ]
a-node $copula_complement :=
[conll/cpos in {"NOUN", "ADJ"}, a-node $copula_verb :=
[conll/deprel = "cop", conll/cpos = "AUX"] ]
Filter query
Learn which deprels the complements had and
how many of which!
• >> this sign introduces the filter
for $my_node.attribute give $1, count()
• give $1 = give me the first column of a table. These are the
values of the conll/deprel attribute, that is, the deprels . Mind the
dot between node name and attribute name!
Querying your own corpus with
PMLTQ locally with TrEd
Inner structure of a ud.node in TrEd
Attributes and values depend
on the language model you
have selected in UDPipe.
Query files
• Save your query file where you
want
• Save it as the first option (PML)
and approve a warning message
that occurs.
• Or create a new file by File ->
New -> Based on the Current
File
- mind to have the cursor in the
query file
• click the New query button
• to see your previous queries,
PageUp
Suggest
1
32
4
TrEd installation
• download installer for your OS here:
http://ufal.mff.cuni.cz/tools/tred
Important libraries are sometimes to be installed
manually like this:
• in your command line go to the tred directory
cd C:tred on Windows (put tred right on C:, never into
Program Files!!!)
tred.bat
cpan -T library name from the error message you've seen;
repeat with each library. It takes time to install!
TrEd configuration (clickable in TrEd)
• Setup -> Manage Extensions
– install:
Start PML Tree Query in TrEd
When you just want to load stuff into
Stylo
• The conllu file is actually a data.frame in R.
• Select the rows and column(s) you want, and
convert to a text file or a plain text string
again.
Read the conllu file
bur_garden_df <- read.table(
file = "burnett_garden_1911.conllu",
sep = "t",
fileEncoding = "UTF-8",
header = FALSE,
comment.char = "#",
blank.lines.skip = TRUE,
quote = "",
col.names = c("id", "form", "lemma", "upos", "xpos",
"feats", "head", "deprel", "misc1", "misc2") )
Should you need sentence IDs
UDPipe in R
• https://cran.r-project.org/web/packages/udpipe/
• have your files processed by UDPipe's API
• train your own model

More Related Content

Similar to Morphosyntactic analysis for stylometry

Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation2040.io
 
Tanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder
 
Early Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelChunhua Liao
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated2040.io
 
Lares from LOW to PWNED
Lares from LOW to PWNEDLares from LOW to PWNED
Lares from LOW to PWNEDChris Gates
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Creating a Benchmarking Infrastructure That Just Works
Creating a Benchmarking Infrastructure That Just WorksCreating a Benchmarking Infrastructure That Just Works
Creating a Benchmarking Infrastructure That Just WorksTim Callaghan
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP
 
Makefile for python projects
Makefile for python projectsMakefile for python projects
Makefile for python projectsMpho Mphego
 

Similar to Morphosyntactic analysis for stylometry (20)

Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Using monodoc
Using monodocUsing monodoc
Using monodoc
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
 
Tanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools short
 
Early Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator Model
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
Lares from LOW to PWNED
Lares from LOW to PWNEDLares from LOW to PWNED
Lares from LOW to PWNED
 
Printing without printers
Printing without printersPrinting without printers
Printing without printers
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Perl Programming - 01 Basic Perl
Perl Programming - 01 Basic PerlPerl Programming - 01 Basic Perl
Perl Programming - 01 Basic Perl
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Creating a Benchmarking Infrastructure That Just Works
Creating a Benchmarking Infrastructure That Just WorksCreating a Benchmarking Infrastructure That Just Works
Creating a Benchmarking Infrastructure That Just Works
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
Makefile for python projects
Makefile for python projectsMakefile for python projects
Makefile for python projects
 
tools
toolstools
tools
 

Recently uploaded

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Morphosyntactic analysis for stylometry

  • 1. Morphosyntactic analysis for stylometry Silvie Cinková cinkova@ufal.mff.cuni.cz COST CA16204 Distant Reading for European Literary History 2018-04-18, Kraków
  • 2. What does morphosyntax tell you? Mary knew the fair young man who looked like a boy.
  • 3.
  • 4.
  • 5.
  • 6. Universal Dependencies • universaldependencies.org • framework for cross-linguistically consistent grammatical annotation • 60+ languages • 100+ treebanks (syntactically analyzed corpora) • parsers! • all open source
  • 7.
  • 8. Universal POS-tags • Language-specific tagsets depend on traditional grammars: excessive diversity • UD is mapping specific tagsets to a common scheme. English - Penn Treebank tagset Polish - IPIPAN tagset
  • 9.
  • 10. Raw parser output: CoNLL-U Format http://universaldependencies.org/format.html
  • 11. Online services http://clarin-pl.eu/en/services/ ws-test.clarin-pl.eu Egipt zdobyć się na taki armia pracownik i on zawdzięczać swój wiekopomny dzieło.
  • 13. UD traps • coordinations • copula predicates • elided verbs
  • 15.
  • 18. How to get your conllu file • Mostly with these parameters: – some model except Baseline UD – Tag and Lemmatize – Parse – Advanced options • Input: Tokenize plain text • tick nothing in Tokenizer if you have plain text without any tags • Your text has to be encoded in UTF-8.
  • 19. When you have syntactic suspicions... – Compare occurrences of selected syntactic phenomena • number of verbal clauses in a sentence • tree depth • multiple attributes • preference for prepositional noun modifiers or compounds (cat admirer vs. admirer of cats) – Extract those phenomena from your texts.
  • 20. Querying a UD-treebank • https://lindat.mff.cuni.cz/services/pmltq/#!/home • PMLTQ (Tree Query Language) – "draw" the subtree you want to extract from the corpus – view the results - tweak query - view - tweak - view... – count them – or group according to additional criteria and count groupwise
  • 21. Pitfalls of Tree Query online • Tree Query Engine – UD nodes are called a-node (and a-root) – deprel => conll/deprel – upos => conll/cpos – xpos = conll/pos • UD versions slightly differ in labeling.
  • 22. Case: How are copula-verb complements labeled? • Follow query here: http://hdl.handle.net/11346/PMLTQ-ROGZ
  • 23. Query http://hdl.handle.net/11346/PMLTQ-ROGZ 1. a-node [, , , ] 2. a-node $blabla := [, , , ] ... and now for real: a-node $blabla := [,, a-node $blablas_child := [] , ] a-node $copula_complement := [conll/cpos in {"NOUN", "ADJ"}, a-node $copula_verb := [conll/deprel = "cop", conll/cpos = "AUX"] ]
  • 24. Filter query Learn which deprels the complements had and how many of which! • >> this sign introduces the filter for $my_node.attribute give $1, count() • give $1 = give me the first column of a table. These are the values of the conll/deprel attribute, that is, the deprels . Mind the dot between node name and attribute name!
  • 25. Querying your own corpus with PMLTQ locally with TrEd
  • 26. Inner structure of a ud.node in TrEd Attributes and values depend on the language model you have selected in UDPipe.
  • 27.
  • 28. Query files • Save your query file where you want • Save it as the first option (PML) and approve a warning message that occurs. • Or create a new file by File -> New -> Based on the Current File - mind to have the cursor in the query file • click the New query button • to see your previous queries, PageUp
  • 30. TrEd installation • download installer for your OS here: http://ufal.mff.cuni.cz/tools/tred Important libraries are sometimes to be installed manually like this: • in your command line go to the tred directory cd C:tred on Windows (put tred right on C:, never into Program Files!!!) tred.bat cpan -T library name from the error message you've seen; repeat with each library. It takes time to install!
  • 31. TrEd configuration (clickable in TrEd) • Setup -> Manage Extensions – install:
  • 32. Start PML Tree Query in TrEd
  • 33. When you just want to load stuff into Stylo • The conllu file is actually a data.frame in R. • Select the rows and column(s) you want, and convert to a text file or a plain text string again.
  • 34. Read the conllu file bur_garden_df <- read.table( file = "burnett_garden_1911.conllu", sep = "t", fileEncoding = "UTF-8", header = FALSE, comment.char = "#", blank.lines.skip = TRUE, quote = "", col.names = c("id", "form", "lemma", "upos", "xpos", "feats", "head", "deprel", "misc1", "misc2") )
  • 35. Should you need sentence IDs
  • 36.
  • 37.
  • 38. UDPipe in R • https://cran.r-project.org/web/packages/udpipe/ • have your files processed by UDPipe's API • train your own model

Editor's Notes

  1. NLP tools have become much more accessible for us non-developers and they have also gained ground in the industry and business (data analytics, text mining); language professionals ought to be able to use them. domain knowledge: what information are these tools providing? Which tools for which use cases? technical knowledge: prepare input, process output. More advanced: adapt the tools to your domain
  2. This is a syntactic-dependency tree automatically created from plain text by a syntactic parser. Unit of description: sentence. For each token: form, lemma, diverse morphological categories, and syntactic relation to its parent node.
  3. Relations between nodes: constituency or dependency (and different modifications) Represent text as a mathematical graph -> formalized general structure, known from non-language applications; algorithms for graph processing available before linguistics
  4. Capturing a broad scale from very surface syntax to (formal) semantics Formal grammars (e.g. HPSG, LFG, Generative Grammar, Dependency Syntax, Minimal-Recursion Semantics, Abstract Meaning Representation, Underspecified Semantic Structures...)
  5. Both the tree structure and the tags are very similar. Differences: who / ktory - see documentation of each treebank.
  6. IPIPAN: Instytut Podstaw Informatyki Polskiej Akademii Nauk Syntactic structures: mostly cross-lingual agreement on the most common ones what to capture (e.g. coordinations, copula constructions, complex verb forms). Approaches different but mostly possible to map on one another. Much less overlap: parts of speech, morphology, the grey zone between morphology and syntax (e.g. Czech past tense is morphological or combined with analytical dependent on person, English perfect tense is always analytical...) UD: POS tags very coarse-grained, but on top of that a set of Universal Features to capture the fine-grained information from the original tagsets.
  7. Zoom in at the data structure. Token = node. Relation between 2 nodes = edge. Dependency trees: parent node vs. child node. One parent per node. Where do xpostags come from? Most currently available UD treebanks are rule-based and manually post-edited conversions of language-specific treebanks. So the original tags were preserved, too.
  8. Conference on Computational Natural Language Learning organizes parsing competitions for which they provide data, hence CONLL format. "-U" stands for Universal. Actually a tab-separated table. Each line contains the analysis of one word (aka token, node). Columns: token ID ( = word order in the sentence) word form/punctuation symbol lemma - dictionary word form (e.g. nominative singular or present active infinitive) UPOS - universal part-of-speech tag XPOS - language-specific part of speech tag (e.g. WSJ or BNC tags for English) FEATS - more universal morphological information (e.g. tense or animacy): Universal Features HEAD - ID of the parent node of the node in question DEPREL - Dependency relation MISC - any other information #Comment lines, also used for sentence ids and text UTF-8 Encoding
  9. CLARIN - Common Language Resources and Technology Infrastructure - nodes across Europe. Different quality of services. Some just host data, but others (e.g. Polish and Czech) offer online services! Check out e.g. LEM. Most online services let you upload a file or a corpus and analyze it for you. You can usually also write a script and send your request to the service via its API.
  10. One of the morphological analyzers LEM offers is MorphoDiTa for Polish. However, MorphoDiTa has been trained on many languages and the UD formalism: you get many different languages analyzed in the same way! MorphoDiTa is a POS-tagger and lemmatizer. It is part of the UDPipe parser. To obtain the syntactic relation, you have to call UDPipe. If you only want lemmatization/POS-tagging, you can still use UDPipe and unselect parsing. Select your language model. Some languages have more than one. The models differ wrt corpora they have been trained on and with which version of UD. To guess which is best for you, you have to learn more about the corpora. The UD versions differ in details of linguistic description (e.g. whether object should be divided into indirect and direct object). The corpora can differ in xpostags (the language-specific POS tags) and also in some more specific syntactic structures (e.g. (I've made it up!) the more the merrier or eager to please vs. difficult to please). You will get "some" result every time, but the quality will depend on how similar your text is to the training data of the parser. If you have e.g. very old English texts or poetry(!) and you run on them a parser trained on the Wall Street Journal, you are likely to get rubbish. In such case, if you really need parsing, you have to collaborate with a computational linguist to train a model for your particular needs. And, mind you, it's your job to provide suitable training data. Maybe you will have to do some manual UD annotation on your texts, and the computer scientist will try to minimize the amount needed from you. However, nobody can promise you a safe number of manually annotated sentences to get acceptable results.
  11. These phenomena may be approached in a way different from what you are used to.
  12. The parser interpreted the coordination correctly, although it could have easily say Apples are [good and yesterday] I bought cherries. I didn't even help with a comma before and. For fun: challenge the parser with Apples are good and Saturday/last Saturday/last weekend I bought cherries and then scale up to ...and last weekend cherries were on sale. You will always get some errors, and most of them are unpredictable, unlike these.
  13. It's the word French that messed it up entirely, as it was interpreted as noun (Frenchma/en). Cf. Germans are correct again. Of course the parser cannot know that each country normally has only one prime minister, so the plural in ministers was no hint. However, the sentence could be ambiguous even for humans. How many Germans were there at the party, even though we know that Germany has only one prime minister?
  14. This comes from an earlier language description, where we used to have a-nodes and t-nodes.
  15. Why? One author may prefer them in the main clauses (then they will be roots), while another may not use them at all, or everywhere, or never as main clauses. You will compare the distributions of labels in these words in different authors. You want to extract all nodes that are nouns or adjectives and have a child node that is the copula verb. The UD documentation tells you that the copula is always marked by the dependency relation cop and that auxiliary verbs have the uPOS AUX. Maybe stating both is redundant, check. We will build such a tree in our query.
  16. Hopefully you will see this query run online... follow the URL handle on previous slide.
  17. The live query contains one more line saying sort by $2 desc. This sorts the labels according to the frequencies, in the descending order. The frequencies elicited by count() are the second column of the table. Check out the tutorial at ufal.mff.cuni.cz/pmltq/doc/pmltq_tutorial_web_client.html !
  18. Left: Inner structure of the node in TrEd. It corresponds to one line in the raw parser output. You elicit this window if you double-click on a node in the tree view. Live: a TrEd demo. Not captured here, really.
  19. Data scheme differences between PMLTQ in the web GUI and your TrEd installation a-node => ud.node conll/deprel => deprel id => ord conll/cpos => upostag
  20. By default, the query file lives in a weird location. On Windows it is AppData/Roaming/... . It feels better to have a more intuitive access to a file that takes you so much efforts. Here is how to save it elsewhere or create a new file. It is useful to have one file per project, unless you run the same queries over and over! When you double-click on the blue id of the tree query, you can alter the part following the underscore.
  21. When you are not confident using PMLTQ, select a subtree you like in your corpus, transfer it into your query file and modify according to your needs. Click on the puzzle piece in the bottom right corner and select reslt:PMLTQ Results. Focus on the tree window. Select nodes you would like to have in your query by clicking on them and holding the Ctrl key. Double check that the puzzle piece is adjusted appropriately. If everything works, your selected nodes ought to swell and turn orange. Switch focus to the query window. Hit the Suggest button in the middle tool panel, upper left. A window pops up. Untick attribute value pairs you don't want to have in your query (typically word form and lemma if you are after a syntactic structure). Select one of the pasting options.
  22. Restart TrEd, just to be sure. Then open your conllu file with File -> Open. To start the tree query engine, select Macros -> Start Tree Query.
  23. This ought to appear. if not, press c. select Files (local). Then select your conllu file(s). The corpus must not be too big, otherwise TrEd would crash.
  24. The conllu file has 10 columns (maybe for yet another MISC. I called it misc2). You will lose the paragraphs. To save them, you would have to exploit the commented lines and add another column with sentence id or so.