CorpusStudio web application
Erwin R. Komen
Meertens Instituut // Radboud University Nijmegen // SIL-International
E.Komen@ru.nl
1. Background
• Existing software:
• CorpusStudio – Windows
• Cesax – Windows
• Successfully used in linguistic research
• Web application version?
• Central location for corpora (‘last’ version)
• Platform independent: MacOS/Linux/Windows
• Fast parallel processing
2. Formats
• FoLiA xml
• Dutch: Nederlab, CGN, Sonar/Lassy
• TEI-Psdx xml
• English historical + SLA
• Caucasian: Chechen, Lak, Lezgi
• Old Welsh
• Dutch
• Additional formats
• Convert via ‘Cesax’ (Alpino, Negra, …)
• Add handler into CorpusStudio
4. Defining queries
• Definition editor
• Constants
• Functions (Xquery)
• Query editor
• Subcategorization (Xquery)
• Constructor editor
• Execution order
• Options (examples, output, complement)
• Result database Feature editor
• Xquery user-functions calculate them
6. Availability
• CorpusStudio sources (build your own version)
• https://github.com/ErwinKomen
• CLARIN-NL access
• http://www.clarin.nl/node/2095
7. References
Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010.
XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>.
van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive
and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013.
Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on
treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia,
Bulgaria: The institute of information and communication technologies, Bulgarian AS.
User informationProject information
Definition
Editor
Query
Editor
Constructor
Editor
Result viewer
Meta Data
Editor
Definitions
Queries
Corpus
Research
Project
(.crpx)
Search service: crpp
Query
Executor
Database
Creator
Output Monitor
Results
(.xml)
Corpus
Research
Database
(.xml)
Table
Viewer
Result
Viewer
Documents
(.xml)
xml
xml
xml
xml
xml
Input
Selector
json
Status
xml
json
Database
feature editor
Result
Grouping
Standard
grouping
(.json)
Grouping
Viewer
Corpus
Viewer
Result database
Result dbase
Viewer
Result dbase
Editor
3. Corpus Research Projects
• All information for one research project
• Meta information (author, dates, goal)
• Input (language, corpus, filter)
• All definition and query files used
• Execution order
• Optional: result database features
• Exchange
• Upload/download
• Compatible with Windows CorpusStudio
CorpusStudio components
Meta Data
Editor
Definition
Editor
Input
Selector
Query
Editor
Constructor
Editor
Output
Monitor
Query
Executor
Result
Viewer
Corpus
Viewer
Database
feature editor
5. Future
• Grouping editor
• Group output over meta-data categories
• User-definable (Xquery)
• Query/project wizard
• Tabular input of principal components
• Relations, names, feature calculations
• Result database editor
• View and edit result database records

Corpus studio Erwin Komen

  • 1.
    CorpusStudio web application ErwinR. Komen Meertens Instituut // Radboud University Nijmegen // SIL-International E.Komen@ru.nl 1. Background • Existing software: • CorpusStudio – Windows • Cesax – Windows • Successfully used in linguistic research • Web application version? • Central location for corpora (‘last’ version) • Platform independent: MacOS/Linux/Windows • Fast parallel processing 2. Formats • FoLiA xml • Dutch: Nederlab, CGN, Sonar/Lassy • TEI-Psdx xml • English historical + SLA • Caucasian: Chechen, Lak, Lezgi • Old Welsh • Dutch • Additional formats • Convert via ‘Cesax’ (Alpino, Negra, …) • Add handler into CorpusStudio 4. Defining queries • Definition editor • Constants • Functions (Xquery) • Query editor • Subcategorization (Xquery) • Constructor editor • Execution order • Options (examples, output, complement) • Result database Feature editor • Xquery user-functions calculate them 6. Availability • CorpusStudio sources (build your own version) • https://github.com/ErwinKomen • CLARIN-NL access • http://www.clarin.nl/node/2095 7. References Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010. XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>. van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian AS. User informationProject information Definition Editor Query Editor Constructor Editor Result viewer Meta Data Editor Definitions Queries Corpus Research Project (.crpx) Search service: crpp Query Executor Database Creator Output Monitor Results (.xml) Corpus Research Database (.xml) Table Viewer Result Viewer Documents (.xml) xml xml xml xml xml Input Selector json Status xml json Database feature editor Result Grouping Standard grouping (.json) Grouping Viewer Corpus Viewer Result database Result dbase Viewer Result dbase Editor 3. Corpus Research Projects • All information for one research project • Meta information (author, dates, goal) • Input (language, corpus, filter) • All definition and query files used • Execution order • Optional: result database features • Exchange • Upload/download • Compatible with Windows CorpusStudio CorpusStudio components Meta Data Editor Definition Editor Input Selector Query Editor Constructor Editor Output Monitor Query Executor Result Viewer Corpus Viewer Database feature editor 5. Future • Grouping editor • Group output over meta-data categories • User-definable (Xquery) • Query/project wizard • Tabular input of principal components • Relations, names, feature calculations • Result database editor • View and edit result database records