Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
About OpenSoNaR-CGN
SoNaR-500 and CGN made accessible through a web application,
WhiteLab, which makes it possible to expl...
Upcoming SlideShare
Loading in …5

Open sonar martinreynaert


Published on

poster at the CLARIAH 2016 day

Published in: Science
  • Be the first to comment

  • Be the first to like this

Open sonar martinreynaert

  1. 1. About OpenSoNaR-CGN SoNaR-500 and CGN made accessible through a web application, WhiteLab, which makes it possible to explore and search these collections with use of information contained in the metadata and linguistic annotations. WhiteLab • Web application for exploring and searching large text collections • Provides direct access to the texts, audio, transcriptions, and linguistic annotations • Uses CQP query language (CQP) • Offers user interfaces for novice, advanced, and expert users • Developed by de Taalmonsters in collaboration with Tilburg University and INL; the current version (2.0) with Radboud University/CLST. Explore • View the composition of a collection or corpus through the tree map view • Retrieve statistics: frequency lists of (word) tokens, lemmas, parts of speech, phonetic form • Retrieve n-grams (max. n=5); combinations of words, lemmas, parts of speech and/or phonetic forms • Retrieve specific samples (CGN) or documents (SoNaR) Search • Selection of subcorpus by means of metadata filter(s) • Specification of search pattern or query involving ̶ one or more word(s) ̶ POS tag(s) ̶ lemma(s) • Queries make use of CQP; however, users can opt to specify their queries without having to use CQP: search patterns formulated in the simple or extended version of the interface are interpreted and converted to CQP automatically. Presentation of results • Concordance (KWIC), sorted on the basis of lexical information or metadata • Link to larger context in which result was found • (For CGN data) link to aligned audio file • Graphical display of frequencies and other statistics Export of results Retrieved lists of (meta) data may be exported in tsv format. SoNaR-500 • Reference corpus of contemporary written Dutch as encountered in texts originating from the Dutch speaking language area in the Netherlands and Flanders as well as Dutch translations published in and targeted at this area. • Comprises 500+ M words (~ 2 M documents) and includes various genres and text types, incl. books, magazines, newspapers, discussion fora, web sites, autocues, and subtitles • Comes with metadata relating to authors and texts • Linguistic annotations available: POS tagging, lemmatisation CGN • Corpus of contemporary spoken standard Dutch as spoken by adults in the Netherlands and Flanders • ~ 9 M words (800+ hours of speech), including various types of speech, ranging from prepared monologues to spontaneous conversations • Audio recordings & orthographic transcriptions • For a subset of the data also phonetic transcriptions are available • Comes with metadata relating to speakers (e.g. gender, age) and recordings • Linguistic annotations include POS tagging and lemmatisation OpenSoNaR-CGN was developed by de Taalmonsters in collaboration with Radboud University Nijmegen/CLST, Tilburg University, and INL. We gratefully acknowledge the feedback we received from our user group and the funding provided by CLARIN NL under grant number CLARIN-NL-15- 005. Tree map view of CGN Metadata filters for specifying subcorpora Query specification in “extended” mode Results presented in the form of a concordance Query in CQP