Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Ca...
DRS: Digital Replay System (video demo)
Current multi-modal corpora- key ‘types’ <ul><li>1- Developed by researchers whose primary concern is on ‘how to improve h...
Multi-modal corpora- types  (based on Oertel et al., 2010)
Limitations of multi-modal corpora <ul><li>Most are relatively small in size. The largest, the AMI (business meeting) corp...
Considerations for multi-modal corpora <ul><li>Design and infrastructure : Concerning what the data in the corpus looks li...
Requirements for multi-modal corpora <ul><li>Quality : The resolution of the video and audio recordings, and the level of ...
DReSS II: Research aims and objectives <ul><li>To produce digital records, which combine familiar forms of data with compu...
Ubiquitous (heterogeneous) corpora
Data Types <ul><li>‘ Data’ types include (in progress): </li></ul><ul><ul><li>SMS/ MMS messaging </li></ul></ul><ul><ul><l...
A day in the life of your language
Software requirements <ul><li>The ability to search data  and  metadata in a principled and specific way (encoded and/or t...
DRS Demo- some new developments
Open questions <ul><li>How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’? </li></ul...
Acknowledgements   Research team The Digital Records for e-Social Science Project is funded by the ESRC.
Upcoming SlideShare
Loading in...5
×

IVACS 2010

390
-1

Published on

- Adolphs, S., Carter, R. and Knight, D.

- Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
390
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • SVENJA
  • SVENJA
  • DAWN
  • DAWN
  • DAWN
  • DAWN
  • DAWN
  • SVENJA
  • SVENJA
  • SVENJA
  • SVENJA
  • DAWN
  • DAWN
  • SVENJA and DAWN
  • SVENJA and DAWN
  • IVACS 2010

    1. 1. Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Carter
    2. 2. DRS: Digital Replay System (video demo)
    3. 3. Current multi-modal corpora- key ‘types’ <ul><li>1- Developed by researchers whose primary concern is on ‘how to improve human-computer interaction’ (Gu, 2006: 132). Here the corpus construction is seen an explorative exercise; to discover the breath and depth of multi-modal datasets that can be assembled, and how this is best achieved </li></ul><ul><li>2- Social science motivated studies, focusing on ‘multi-modal and multimedia studies of discourse’ where the primary focus us on a concern for the behaviour of ‘human beings’ (Gu, 2006: 132) </li></ul>
    4. 4. Multi-modal corpora- types (based on Oertel et al., 2010)
    5. 5. Limitations of multi-modal corpora <ul><li>Most are relatively small in size. The largest, the AMI (business meeting) corpus, contains 100 hours of video, but only a small number of these have been transcribed/annotated </li></ul><ul><li>The majority record in lab-type settings, with fixed cameras and up to 4-5 static, seated participants </li></ul><ul><li>Many are specialist and context/domain specific- the extent to which data is ‘naturalistic’ or ‘real-life’ can be challenged </li></ul><ul><li>The availability of data and software is generally limited to institutions/funding bodies responsible for constructing datasets </li></ul>
    6. 6. Considerations for multi-modal corpora <ul><li>Design and infrastructure : Concerning what the data in the corpus looks like; what sorts of recordings and included and the basic design methodology used to collect, compile and represent this data </li></ul><ul><li>Size and scope : Amount of data (in terms of hours and/or word count) and the variation in the types included (in terms of the range of speakers or different contexts included and so on) </li></ul>
    7. 7. Requirements for multi-modal corpora <ul><li>Quality : The resolution of the video and audio recordings, and the level of detail added to annotations and/or metadata recorded associated with these </li></ul><ul><li>Authenticity : How ‘natural’ or real is the data; is it scripted and/or structured or more spontaneous </li></ul><ul><li>Availability and (re)usability : Access rights to data, whether corpora are published and can be utilised and analysed by other researchers </li></ul>
    8. 8. DReSS II: Research aims and objectives <ul><li>To produce digital records, which combine familiar forms of data with computational recordings of interaction </li></ul><ul><ul><li>Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society </li></ul></ul><ul><ul><li>Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society </li></ul></ul><ul><li>These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ? </li></ul>
    9. 9. Ubiquitous (heterogeneous) corpora
    10. 10. Data Types <ul><li>‘ Data’ types include (in progress): </li></ul><ul><ul><li>SMS/ MMS messaging </li></ul></ul><ul><ul><li>Blogging </li></ul></ul><ul><ul><li>Chat room and message board discourse </li></ul></ul><ul><ul><li>Email Usage </li></ul></ul><ul><ul><li>Face-to-face situated discourse </li></ul></ul><ul><ul><li>GPS or manual map-based tracking </li></ul></ul><ul><ul><li>Web browsing activity (automated logging of sites) </li></ul></ul><ul><ul><li>Phone calls (home and/or mobile) </li></ul></ul><ul><ul><li>Text messaging </li></ul></ul><ul><ul><li>Video calls (mobile or online, e.g. Skype </li></ul></ul>
    11. 11. A day in the life of your language
    12. 12. Software requirements <ul><li>The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions. </li></ul><ul><li>Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such). </li></ul><ul><li>Variability in the provisions for transcription and the ability for, for example, representing simultaneous speech and speaker overlaps. </li></ul><ul><li>Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics. </li></ul>
    13. 13. DRS Demo- some new developments
    14. 14. Open questions <ul><li>How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’? </li></ul><ul><li>How can we model context in a way that is meaningful to humans as well as to computers? </li></ul><ul><li>How will the development of heterogeneous corpora impact on: </li></ul><ul><ul><ul><li>Theories </li></ul></ul></ul><ul><ul><ul><li>Concepts </li></ul></ul></ul><ul><ul><ul><li>Methods </li></ul></ul></ul><ul><ul><ul><li>Guidelines </li></ul></ul></ul>
    15. 15. Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.

    ×