The Perfect Swell:
Workshop on Text and Data Mining
for Data Driven Innovation
The research infrastructure perspective
Die...
CLARIN?
§  Common Language Resources and Technology
Infrastructure
§  aims at providing easy and sustainable access for ...
Language resources: rich variety
§  Modality: written, spoken, signed
§  Additional channels: eye movements, gestures, n...
Language resources: rich variety
§  Location:
§  data from all over the world (including
some very remote corners)
§  …...
Data mining in CLARIN
§  very important paradigm in language resource processing
§  major shift from rule-based to data-...
Data mining in CLARIN
§  some examples to demonstrate the variation and nature of
data mining based on language resources...
Some examples (1)
TDM workshop
London
2013-09-27
www.clarin.eu
§  Mass text analysis (Petersen et al., 2012):
doi:10.1038...
Some examples (2)
TDM workshop
London
2013-09-27
www.clarin.eu
§  AUVIS face/hand tracking analysis: http://tla.mpi.nl/
p...
Some examples (3)
TDM workshop
London
2013-09-27
www.clarin.eu
§  Stylometry and plagiarism detection
http://www.clips.ua...
Some examples (4)
TDM workshop
London
2013-09-27
www.clarin.eu
§  Language evolution analysis with phylogenetic trees (Bo...
The research infrastructure role
§  Data sets:
§  Long-term preservation (archiving)
§  Making them citable (persistent...
Legal perspective on resources
TDM workshop
London
2013-09-27
www.clarin.eu
§  Rough classification of language resources...
Legal perspective on resources
§  CLARIN recommends CC licenses for new resources as
this is the least problematic for al...
Technical Perspective (1)
§  The above restrictions can be realized by requiring:
§  PUB - no identification of the user...
Technical Perspective (2)
§  Federated Identity Management (“Shibboleth”)
§  allows to access resources at a remote serv...
Future perspective for legal
exception framework
§  As we in CLARIN are capable of
§  identifying researchers and
§  pr...
Conclusion
§  Datamining plays an increasingly important role in
(language resource-based) research
§  Research infrastr...
Acknowledgement
§  Thanks to Krister Lindén and Erik Ketzan from the
CLARIN legal issues committee for their valuable
inp...
Upcoming SlideShare
Loading in...5
×

The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

1,821

Published on

Presentation by Dieter Van Uytvanck (CLARIN) from 'The Prefect Swell' workship on text and data mining on the 27th of September 2013.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,821
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

  1. 1. The Perfect Swell: Workshop on Text and Data Mining for Data Driven Innovation The research infrastructure perspective Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl TDM workshop, London 2013-09-27
  2. 2. CLARIN? §  Common Language Resources and Technology Infrastructure §  aims at providing easy and sustainable access for scholars in the humanities and social sciences §  to digital language data (in written, spoken, video or multimodal form) §  to advanced tools to discover, explore, exploit, annotate, analyse or combine them §  independent of where they are located: a shared distributed infrastructure §  More information: www.clarin.eu TDM workshop London 2013-09-27 www.clarin.eu
  3. 3. Language resources: rich variety §  Modality: written, spoken, signed §  Additional channels: eye movements, gestures, neuro- imaging data (EEG, fMRI, …), etc. TDM workshop London 2013-09-27 www.clarin.euAnnotations Data: the basis for research
  4. 4. Language resources: rich variety §  Location: §  data from all over the world (including some very remote corners) §  … and from the world wide web, smartphones, … §  Time: §  old historic collections (hieroglyphs, manuscripts, rock carvings, …), often OCR’ed, digitised and annotated §  up to real-time data gathered from social networks §  Origin: §  elicited (experiments) §  natural language use (“in the wild”) TDM workshop London 2013-09-27 www.clarin.eu Annotations a: the basis for research
  5. 5. Data mining in CLARIN §  very important paradigm in language resource processing §  major shift from rule-based to data-driven systems §  not only text, also multimedia §  importance of §  access to primary data for fellow researchers: need access to whole works and not only to snippets and sentences in order to do TDM. §  replicating experiments utterly important §  technical support: virtual collections allow to refer to large online data sets §  safe legal setting for researchers (license signing does not scale to 500.000 texts that are automatically collected from thousands of websites) TDM workshop London 2013-09-27 www.clarin.eu
  6. 6. Data mining in CLARIN §  some examples to demonstrate the variation and nature of data mining based on language resources TDM workshop London 2013-09-27 www.clarin.eu
  7. 7. Some examples (1) TDM workshop London 2013-09-27 www.clarin.eu §  Mass text analysis (Petersen et al., 2012): doi:10.1038/srep00313
  8. 8. Some examples (2) TDM workshop London 2013-09-27 www.clarin.eu §  AUVIS face/hand tracking analysis: http://tla.mpi.nl/ projects_info/auvis/ Head/Hands Tracking
  9. 9. Some examples (3) TDM workshop London 2013-09-27 www.clarin.eu §  Stylometry and plagiarism detection http://www.clips.ua.ac.be/category/projects/stylometry §  e.g. Mike Kestemont, http://www.mike-kestemont.org/?p=362
  10. 10. Some examples (4) TDM workshop London 2013-09-27 www.clarin.eu §  Language evolution analysis with phylogenetic trees (Bouckaert et al., 2012) – doi:10.1126/science.1219669 At the other extreme, we fit a “sailor” model with no reluctance to move into water and rapid move- ment across water. Consistent with the findings based on the RRW model, each of the landscape- based models supports the Anatolian farming theory of Indo-European origin (Table 1). Our results strongly support an Anatolian homeland for the Indo-European language family. The inferred location (Fig. 1) and timing [95% highest posterior density (HPD) interval, 7116 to 10,410 years ago] of Indo-European origin is con- gruent with the proposal that the family began to diverge with the spread of agriculture from Fig. 2. Map and maximum clade credibility tree showing the diversification of the major Indo-European subfamilies. The tree shows the timing of the emergence of the major branches and their subsequent diversification. The inferred location at the root of each subfamily is shown on the map, colored to match the corresponding branches on the tree. Albanian, Armenian, and Greek subfamilies are shown separately for clarity (inset). Contours represent the 95% (largest), 75%, and 50% HPD regions, based on kernel density estimates (15). Phylogeographic analysis Bayes factor Anatolian vs. steppe I Anatolian vs. steppe II RRW: All languages 175.0 159.3 RRW: Ancient languages only 1404.2 1582.6 RRW: Contemporary languages only 12.0 11.4 Landscape aware: Diffusion 298.2 141.9 Landscape aware: Migration from land into water less likely than from land to land by a factor of 10 197.7 92.3 Landscape aware: Migration from land into water less likely than from land to land by a factor of 100 337.3 161.0 Landscape aware: Sailor 236.0 111.7 onAugust24,2012www.sciencemag.orgDownloadedfrom
  11. 11. The research infrastructure role §  Data sets: §  Long-term preservation (archiving) §  Making them citable (persistent identifiers) and findable (metadata) §  Making access easier with federated login §  Lowering the threshold to use advanced software §  offer web front-ends, web service chains §  cooperation with computing centres for heavy tasks §  Know-how building & support §  about the nature of the resources and tools §  technical matters §  legal issues TDM workshop London 2013-09-27 www.clarin.eu
  12. 12. Legal perspective on resources TDM workshop London 2013-09-27 www.clarin.eu §  Rough classification of language resources available via the CLARIN centres: §  Public §  full access, no restrictions at all §  e.g. parallel corpora from the EU Parliament §  Academic §  available for all academic users §  e.g. corpus spoken Dutch (radio recordings, …) §  Restricted §  everything more restricted than Academic > personalised access rules §  e.g. video from doctor-patient interaction Examples of each process Resource 2.12.2010 Figure 6 Three main cont the additional requiremen 3.3 The prerequisit The CLARIN prototype s Examples of each process Resource 2.12.2010 Figure 6 Three main cont the additional requiremen The summary of the class Figure 5 above. The CLARIN prototype s Examples of each process Resource 2.12.2010 Figure 6 Three main cont the additional requiremen
  13. 13. Legal perspective on resources §  CLARIN recommends CC licenses for new resources as this is the least problematic for all in the long run. Such resources can be made publicly available. §  For older material, we try to distribute them as freely as can be negotiated. For these we offer two categories: §  resources free for researchers §  resources requiring individual permission by the owner. §  It is good to note that not everything is about copyright. §  We also have to deal with personal data which can only be provided for a limited time to individual researchers unless they are anonymized. §  Also ethical perspectives should be taken into account. (e.g. asking participants if they are ok with data mining/processing at the time of recording) TDM workshop London 2013-09-27 www.clarin.eu
  14. 14. Technical Perspective (1) §  The above restrictions can be realized by requiring: §  PUB - no identification of the user and no individual permission, i.e. the resources are free for all and publicly available. §  ACA - identification of the user, but no individual permission, e.g. CLARIN-distributed resources for academic use. §  RES - identification of the user and individual usage permission, i.e. the resources are restrictedly available to individual researchers, e.g. resources containing personal data. TDM workshop London 2013-09-27 www.clarin.eu
  15. 15. Technical Perspective (2) §  Federated Identity Management (“Shibboleth”) §  allows to access resources at a remote server §  with institutional credentials §  makes it relatively straight-forward to recognize academic users and grant them access to restricted resources §  details: http://clarin.eu/node/3788 TDM workshop London 2013-09-27 www.clarin.eu
  16. 16. Future perspective for legal exception framework §  As we in CLARIN are capable of §  identifying researchers and §  protecting the resources from other users, §  CLARIN already has all the technical prerequisites needed for implementing and supervising a broad research exception in the EU such as the one already in effect in the Netherlands. TDM workshop London 2013-09-27 www.clarin.eu
  17. 17. Conclusion §  Datamining plays an increasingly important role in (language resource-based) research §  Research infrastructures try to assist academics to make efficiently use of the existing resources and tools §  Many technical issues have been addressed already (e.g. authentication of researchers) §  We hope remaining legal (copyright) issues could be addressed by a research exception (or likewise a concept of fair use) TDM workshop London 2013-09-27 www.clarin.eu
  18. 18. Acknowledgement §  Thanks to Krister Lindén and Erik Ketzan from the CLARIN legal issues committee for their valuable input! §  Thank you for your attention! TDM workshop London 2013-09-27 www.clarin.eu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×