SlideShare a Scribd company logo
1 of 37
AG Corpus-écrits, 21 novembre 
Consortium Corpus-écrits 
SIG 
TEI-CMC 
Open Resources and 
TOols for LANGuage 
http://comere.org 
http://hdl.handle.net/11403/comere 
Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham, 
Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, 
Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
2 
http://www.tei-c.org/Activities/SIG/CMC/ 
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as resources for empirical research on 
CMC phenomena in the Humanities (linguistics, communication 
science, language technology, …) 
Cette resource doit donc être libre d'accès (open 
access research data) afin d'être réutilisable par les 
communautés de chercheurs 
Nous reviendrons plus tard sur ce point
Our subject and goals 
Computer-mediated communication (CMC): 
All genres of interpersonal communication mediated 
through computer networks (the internet) and used 
via personal computers and/or mobile devices: chats, 
online forums, instant messaging, tweets, comments 
on weblogs, discussions in wikis and on “social net-work” 
sites, interactions in multimodal communication 
environments such as Skype, MMORPGs or “virtual 
worlds” (e.g., SecondLife), SMS, WhatsApp, ....
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as resources for empirical research on 
CMC phenomena in the Humanities (linguistics, communication 
science, language technology, …) 
Our vision: These corpora shall be … 
 interoperable (i) with each other and (ii) with other types of 
linguistic corpora (text corpora, speech corpora) 
 represented conformant to established encoding standards in 
the field of Digital Humanities 
 linguistically annotated in order to allow for sophisticated 
queries and language-focused research
Our subject and goals 
The problem / challenge: 
 By now, there are no established standards for the 
representation of CMC genres 
 Established standards for the representation of text genres do 
not include models for the representation of the peculiarities of 
CMC 
 “Off the shelf” NLP tools for automatic linguistic analysis and 
annotation (tokenizers, part-of-speech taggers, lematizers, 
normalizers, parsers) do not perform well on CMC data 
(because they usually have been trained on edited text and 
therefore can’t handle “non-standard” phenomena and 
multimodal elements in CMC discourse)
Our subject and goals 
Our goals: 
 work on solutions for these desiderata 
 develop suggestions for standards for 
- packaging and sharing (mono- and multimodal) CMC 
corpora, 
- modeling these types of “texts” within a framework which is 
conformant with the encoding framework of the Text 
Encoding Initiative (TEI) and thus with a widely accepted de-facto 
standard in the field of Digital Humanities, 
- processing and annotating these corpora (part-of-speech, 
normalization, ...) with NLP tools.
Who belongs to our community (so far)? 
Our kernel projects 
and founding members 
http://http://glottoweb.org/web2corpus/ 
http://hdl.handle.net/11403/comere 
French CMC corpora 
Infrastructure for languages 
National consortium on corpora 
National infrastructure 
for Digital Humanities 
Scientific network 
„Empirical research of CMC“ 
http://www.empirikom.net 
Dortmund Chat Corpus 
http://www.chatkorpus.tu-dortmund.de 
German Reference Corpus of CMC 
http://www.tinyurl.com/derik-llc 
Wikipedia corpus in DeReKo 
(Mannheim) 
German CMC corpora 
Dutch CMC corpora 
SoNaR 
(Stevin Nederlandstalig Referentiecorpus) 
Italian CMC pilot corpus
Activities and initiatives (past and future) 
2013, 2014 
-European workshops on CMC corpora (Dortmund 
- special journal issue (JLCL) 
9 
Our 
pathway 
2013 
creation of the TEI-CMC SIG 
End of 2014 
Publication of CMC French 
corpora (CoMeRe) in open 
access, all TEI-CMC 
2015 
Application to CLARIN-DE 
Tranform existing German 
corpora into TEI-CMC 
2015 October 
International 
CMC conference 
Rennes (Ledegen) 
2015 
Submission 
of TEI-CMC 
model 
2015 
Launch larger 
CMC-corpora 
community 
2016 
Common system 
of basic CMC-annotations 
(POS tagging)
Project supported by the national 
consortium Corpus-écrits, sub-part of 
Huma-Num, and Ortolang 
Consortium Corpus-écrits 
Objective: Kernel corpus assembling existing corpora of different CMC 
genres and new corpora build on data extracted from the Internet. These 
heterogeneous corpora will be structured and processed in a uniform way, 
complemented with metadata. CoMeRe will be released as OpenData 
through the national infrastructure Ortolang, following constraints which will 
be reused for the forthcoming “Corpus de Référence du Français”. 
Variety + Standards + Open Access 
http://comere.org 
http://hdl.handle.net/11403/comere
11 
Dépositeur individuel 
Serveur 
Local LRL 
Ingénieur : 
Kun Jin 
Groupe qualité 
Discussion avec 
dépositeur 
Groupe étiquetage 
TAL : TEI-v2 
TEI-V1 
Financements : ORTOLANG > Corpus-écrits > LRL
12
13
Ref Tokens Partici. Posts Envir. 
(Antoniadis,2014) 449 313 359 22 052 SMS 
(Falaise, 2014) 35 M 25 000 3 M textchat 
(Ledegen, 2014) 357 000 850 22 000 SMS 
(Reffay et al., 2014) 600 000 67 + 4 groups 
- textchat: 6 790 
- emails: 2 030 
- forums: 2 686 
LMS 
(Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat 
(Abendroth-Timmer 
et al., 2014) 273 546 26 + 4 groups 1 200 Blog 
(Longhi, Marinica, 
2014) 567 851 205 34273 Tweet 
Informal 
business 
Informal 
Informal 
education 
education 
education 
14 
politic
15
16
17
18
19
20
21
22
23
24
25 
Mono 
- Mode 
- Modality 
- Textchat 
- Forum 
- SMS 
- Tweets 
- Email 
- Blogs 
(image 
not means of interaction) 
Verbal Verbal & Non-verbal 
Multi 
Modalities 
LMS: 
- email 
- forum 
- chat 
Multi 
Modes 
Conf system: 
- Audiochat 
- Textchat 
Conference system, 
3D environment 
Etc. 
- Audiochat 
- Textchat 
- Icones 
- Collec prod 
Whiteboard 
Word proc. 
Semantic maps 
- Avatars 
- …
26 
Time(s) 
Interaction 
Space 
Locations 
Course 
Session 
Channel 
Simultaneity 
Participants 
Environments 
Author 
Adresse(s) 
Group 
Network
http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI 
27 
New macro-level elements
1.5 mn video 
* Paper: (Wigham & Chanier, 2013) CALL 
journal 
* Data: (Wigham, 2013) LETEC corpus 
Modality interplay 
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
Multimodalité : Verbal et non verbal 
(Wigham & Chanier, 2013) 
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
Context: Lyceum conf environment, 3 learners (English L2) working into 
a word processor: one writing, others helping 
30 
Collab word 
processor 
Audio: 
clarification 
Textchat: 
Correction 
(with error) 
Textchat: 
Request 
confirmation 
Maintenant en 
TEI-speech
31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
32
l'utilisateur est autorisé à télécharger une copie du corpus […] 
• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […] 
• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […] 
• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur 
le fondement de la présente licence d'utilisation. 
Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus) 
Example of corpus licence displayed on the National Infrastructure for Digital 
Humanities and considered as being"open access" 
Viewing but not re-using is 
that OA ? 
33
34
35
36
37

More Related Content

What's hot

Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016
Daniele Giacomini
 

What's hot (13)

Language Literacy & MOOCs
Language Literacy & MOOCsLanguage Literacy & MOOCs
Language Literacy & MOOCs
 
Structured Interactive Scores
Structured Interactive ScoresStructured Interactive Scores
Structured Interactive Scores
 
ClipFlair Final Report - September 2014
ClipFlair Final Report - September 2014ClipFlair Final Report - September 2014
ClipFlair Final Report - September 2014
 
Map of the CETIS metadata and digital repository interoperability domain
Map of the CETIS metadata and digital repository interoperability domainMap of the CETIS metadata and digital repository interoperability domain
Map of the CETIS metadata and digital repository interoperability domain
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016
 
Exlporing New challenges in TELL: Language Learning MOOCs
Exlporing New challenges in TELL: Language Learning MOOCsExlporing New challenges in TELL: Language Learning MOOCs
Exlporing New challenges in TELL: Language Learning MOOCs
 
Project presentation
Project presentationProject presentation
Project presentation
 
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMSFajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
 
Provenance for Multimedia
Provenance for MultimediaProvenance for Multimedia
Provenance for Multimedia
 
Programming and problem solving with c++, 3rd edition
Programming and problem solving with c++, 3rd editionProgramming and problem solving with c++, 3rd edition
Programming and problem solving with c++, 3rd edition
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Cp viva q&a
Cp viva q&aCp viva q&a
Cp viva q&a
 

Similar to Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
infoclio.ch
 
2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis
Rogério Correia
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
Springer
 

Similar to Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC (20)

WP3 Further specification of Functionality and Interoperability - Gradmann / ...
WP3 Further specification of Functionality and Interoperability - Gradmann / ...WP3 Further specification of Functionality and Interoperability - Gradmann / ...
WP3 Further specification of Functionality and Interoperability - Gradmann / ...
 
Eswc14demo
Eswc14demoEswc14demo
Eswc14demo
 
MOS MindOnSite
MOS MindOnSiteMOS MindOnSite
MOS MindOnSite
 
Participatory Media Literacy Diverse2008
Participatory Media Literacy Diverse2008Participatory Media Literacy Diverse2008
Participatory Media Literacy Diverse2008
 
Research at RMOD
Research at RMODResearch at RMOD
Research at RMOD
 
Needs of others November 2011
Needs of others November 2011Needs of others November 2011
Needs of others November 2011
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708
 
Virtual Communities of Practice – does technology make a difference?
Virtual Communities of Practice – does technology make a difference?Virtual Communities of Practice – does technology make a difference?
Virtual Communities of Practice – does technology make a difference?
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
 
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
 
A startup with no office, hipster tools and open source products
A startup with no office, hipster tools and open source productsA startup with no office, hipster tools and open source products
A startup with no office, hipster tools and open source products
 
Language Resources for Multilingual Europe
Language Resources for Multilingual EuropeLanguage Resources for Multilingual Europe
Language Resources for Multilingual Europe
 
2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis
 
The Standards Dilemma - Digital Library Standards 2008
The Standards Dilemma - Digital Library Standards 2008The Standards Dilemma - Digital Library Standards 2008
The Standards Dilemma - Digital Library Standards 2008
 
DAISY Consortium Open Source Projects
DAISY Consortium Open Source ProjectsDAISY Consortium Open Source Projects
DAISY Consortium Open Source Projects
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
Approaches to supporting Open Educational Resource projects
Approaches to supporting Open Educational Resource projectsApproaches to supporting Open Educational Resource projects
Approaches to supporting Open Educational Resource projects
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Semi-Automated Assistance for Conceiving Chatbots
Semi-Automated Assistance for Conceiving ChatbotsSemi-Automated Assistance for Conceiving Chatbots
Semi-Automated Assistance for Conceiving Chatbots
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 

More from Thierry Chanier

OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...
Thierry Chanier
 
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Thierry Chanier
 

More from Thierry Chanier (6)

(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
 
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
 
OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...
 
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
 
Corpus communication médiée par les réseaux en français et corpus allemand et...
Corpus communication médiée par les réseaux en français et corpus allemand et...Corpus communication médiée par les réseaux en français et corpus allemand et...
Corpus communication médiée par les réseaux en français et corpus allemand et...
 
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
 

Recently uploaded

Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!
University of Hertfordshire
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
Sérgio Sacani
 
MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...
Annibale Panichella
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
Sérgio Sacani
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
jyothisaisri
 

Recently uploaded (20)

Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptx
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
 
-case selection and treatment planing.pptx
-case selection and treatment planing.pptx-case selection and treatment planing.pptx
-case selection and treatment planing.pptx
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed system
 
In-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptxIn-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptx
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
Cellular Communication and regulation of communication mechanisms to sing the...
Cellular Communication and regulation of communication mechanisms to sing the...Cellular Communication and regulation of communication mechanisms to sing the...
Cellular Communication and regulation of communication mechanisms to sing the...
 
MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...MIP Award presentation at the IEEE International Conference on Software Analy...
MIP Award presentation at the IEEE International Conference on Software Analy...
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 
GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 

Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

  • 1. AG Corpus-écrits, 21 novembre Consortium Corpus-écrits SIG TEI-CMC Open Resources and TOols for LANGuage http://comere.org http://hdl.handle.net/11403/comere Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham, Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
  • 3. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Cette resource doit donc être libre d'accès (open access research data) afin d'être réutilisable par les communautés de chercheurs Nous reviendrons plus tard sur ce point
  • 4. Our subject and goals Computer-mediated communication (CMC): All genres of interpersonal communication mediated through computer networks (the internet) and used via personal computers and/or mobile devices: chats, online forums, instant messaging, tweets, comments on weblogs, discussions in wikis and on “social net-work” sites, interactions in multimodal communication environments such as Skype, MMORPGs or “virtual worlds” (e.g., SecondLife), SMS, WhatsApp, ....
  • 5. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Our vision: These corpora shall be …  interoperable (i) with each other and (ii) with other types of linguistic corpora (text corpora, speech corpora)  represented conformant to established encoding standards in the field of Digital Humanities  linguistically annotated in order to allow for sophisticated queries and language-focused research
  • 6. Our subject and goals The problem / challenge:  By now, there are no established standards for the representation of CMC genres  Established standards for the representation of text genres do not include models for the representation of the peculiarities of CMC  “Off the shelf” NLP tools for automatic linguistic analysis and annotation (tokenizers, part-of-speech taggers, lematizers, normalizers, parsers) do not perform well on CMC data (because they usually have been trained on edited text and therefore can’t handle “non-standard” phenomena and multimodal elements in CMC discourse)
  • 7. Our subject and goals Our goals:  work on solutions for these desiderata  develop suggestions for standards for - packaging and sharing (mono- and multimodal) CMC corpora, - modeling these types of “texts” within a framework which is conformant with the encoding framework of the Text Encoding Initiative (TEI) and thus with a widely accepted de-facto standard in the field of Digital Humanities, - processing and annotating these corpora (part-of-speech, normalization, ...) with NLP tools.
  • 8. Who belongs to our community (so far)? Our kernel projects and founding members http://http://glottoweb.org/web2corpus/ http://hdl.handle.net/11403/comere French CMC corpora Infrastructure for languages National consortium on corpora National infrastructure for Digital Humanities Scientific network „Empirical research of CMC“ http://www.empirikom.net Dortmund Chat Corpus http://www.chatkorpus.tu-dortmund.de German Reference Corpus of CMC http://www.tinyurl.com/derik-llc Wikipedia corpus in DeReKo (Mannheim) German CMC corpora Dutch CMC corpora SoNaR (Stevin Nederlandstalig Referentiecorpus) Italian CMC pilot corpus
  • 9. Activities and initiatives (past and future) 2013, 2014 -European workshops on CMC corpora (Dortmund - special journal issue (JLCL) 9 Our pathway 2013 creation of the TEI-CMC SIG End of 2014 Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC 2015 Application to CLARIN-DE Tranform existing German corpora into TEI-CMC 2015 October International CMC conference Rennes (Ledegen) 2015 Submission of TEI-CMC model 2015 Launch larger CMC-corpora community 2016 Common system of basic CMC-annotations (POS tagging)
  • 10. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang Consortium Corpus-écrits Objective: Kernel corpus assembling existing corpora of different CMC genres and new corpora build on data extracted from the Internet. These heterogeneous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Variety + Standards + Open Access http://comere.org http://hdl.handle.net/11403/comere
  • 11. 11 Dépositeur individuel Serveur Local LRL Ingénieur : Kun Jin Groupe qualité Discussion avec dépositeur Groupe étiquetage TAL : TEI-v2 TEI-V1 Financements : ORTOLANG > Corpus-écrits > LRL
  • 12. 12
  • 13. 13
  • 14. Ref Tokens Partici. Posts Envir. (Antoniadis,2014) 449 313 359 22 052 SMS (Falaise, 2014) 35 M 25 000 3 M textchat (Ledegen, 2014) 357 000 850 22 000 SMS (Reffay et al., 2014) 600 000 67 + 4 groups - textchat: 6 790 - emails: 2 030 - forums: 2 686 LMS (Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat (Abendroth-Timmer et al., 2014) 273 546 26 + 4 groups 1 200 Blog (Longhi, Marinica, 2014) 567 851 205 34273 Tweet Informal business Informal Informal education education education 14 politic
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25 Mono - Mode - Modality - Textchat - Forum - SMS - Tweets - Email - Blogs (image not means of interaction) Verbal Verbal & Non-verbal Multi Modalities LMS: - email - forum - chat Multi Modes Conf system: - Audiochat - Textchat Conference system, 3D environment Etc. - Audiochat - Textchat - Icones - Collec prod Whiteboard Word proc. Semantic maps - Avatars - …
  • 26. 26 Time(s) Interaction Space Locations Course Session Channel Simultaneity Participants Environments Author Adresse(s) Group Network
  • 28. 1.5 mn video * Paper: (Wigham & Chanier, 2013) CALL journal * Data: (Wigham, 2013) LETEC corpus Modality interplay Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
  • 29. Multimodalité : Verbal et non verbal (Wigham & Chanier, 2013) Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
  • 30. Context: Lyceum conf environment, 3 learners (English L2) working into a word processor: one writing, others helping 30 Collab word processor Audio: clarification Textchat: Correction (with error) Textchat: Request confirmation Maintenant en TEI-speech
  • 32. 32
  • 33. l'utilisateur est autorisé à télécharger une copie du corpus […] • la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […] • la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […] • la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur le fondement de la présente licence d'utilisation. Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus) Example of corpus licence displayed on the National Infrastructure for Digital Humanities and considered as being"open access" Viewing but not re-using is that OA ? 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37

Editor's Notes

  1. I‘ll do a final editing of this slide in the next days (in Romne, before the meeting...)
  2. Parler des citations / références
  3. Journal de recherche structuré : création du chercheur pas documentaliste. Comere repository Polititwwet OLAC : métadonnées réduites pour Clarin Sautez un niveau pilitiwwet Aller au détail polititweet : manuel PDF Puis Simuligne diversité avec LMS, participants
  4. Dans la première on peut rectifier à la main. Malheureusement, les discussions sont organisées de façon très variées. Assez souvent les auteurs ne respectent pas ces consignes. La Figure 3‑3 en donne une illustration. Une personne tape explicitement les graphies Réponse : au début de son texte puis semble signé en faisant appel à la marque d'indentation, seulement pour cette signature. Ici la signature n'indique qu'une adresse IP et la date. On hésite à savoir où se termine le texte du premier auteur. Celui qui répond intervient semble-t-il deux fois, sans respecter les formats et semble terminer par une indication de signature, Curry (pas au sens Wikipédia cependant). Si l'on examine le lien associé à ce dernier mot, on trouve, non une page d'auteur mais une page générale de Wikipédia (cf. Figure 3‑4) ! Traiter automatiquement de telles pages pose donc problème.
  5. An Interaction Space is an abstract concept, located in time (with a beginning and ending date with absolute time, hence a time frame) where interactions between a set of participants occur within an online location . The online location is defined by the properties of the set of environments used by the set of participants
  6. In one of our paper, which will appear in the CALL journal, and the corresponding data are already online in Mulce, Ciara Wigham discusses the interplay between audio and textchat. Here is an extract from Archi21. In the left column you have the transcription of the audio of one learner, who presents his feeling related to the on-going process of his architectural project. He is a French native and speaks in English as his L2. In the 3 other columns on the right, you find textchats turns coming from the tutor and two other learners belonging to the same architectural project group. Let me show you a short video. **** In this example of conversation doubling, the acts in the text chat respond to the voice chat (blue arrows) but equally acts in the voice chat respond to the text chat (orange arrows) and text chat acts respond to interaction in both voice chat and text chat modalities and prompt interaction in both modalities
  7. http://88milsms.huma-num.fr/corpus.html
  8. There exist 3 main criteria that research data should follow in order to be considered OpenData. Besides being obviously available, the interesting perspective is the fact that data can be access in order to be reuse and mix with other data and licence should explicitly mention this. Second interesting point is that the constraints for reuse should be reduced to a minimum, then the definition stipulate that non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes are not allowed