• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IVACS 2010

IVACS 2010



- Adolphs, S., Carter, R. and Knight, D.

- Adolphs, S., Carter, R. and Knight, D.

- Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis.



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.linkedin.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • DAWN
  • DAWN
  • DAWN
  • DAWN
  • DAWN
  • DAWN
  • DAWN

IVACS 2010 IVACS 2010 Presentation Transcript

  • Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Carter
  • DRS: Digital Replay System (video demo)
  • Current multi-modal corpora- key ‘types’
    • 1- Developed by researchers whose primary concern is on ‘how to improve human-computer interaction’ (Gu, 2006: 132). Here the corpus construction is seen an explorative exercise; to discover the breath and depth of multi-modal datasets that can be assembled, and how this is best achieved
    • 2- Social science motivated studies, focusing on ‘multi-modal and multimedia studies of discourse’ where the primary focus us on a concern for the behaviour of ‘human beings’ (Gu, 2006: 132)
  • Multi-modal corpora- types (based on Oertel et al., 2010)
  • Limitations of multi-modal corpora
    • Most are relatively small in size. The largest, the AMI (business meeting) corpus, contains 100 hours of video, but only a small number of these have been transcribed/annotated
    • The majority record in lab-type settings, with fixed cameras and up to 4-5 static, seated participants
    • Many are specialist and context/domain specific- the extent to which data is ‘naturalistic’ or ‘real-life’ can be challenged
    • The availability of data and software is generally limited to institutions/funding bodies responsible for constructing datasets
  • Considerations for multi-modal corpora
    • Design and infrastructure : Concerning what the data in the corpus looks like; what sorts of recordings and included and the basic design methodology used to collect, compile and represent this data
    • Size and scope : Amount of data (in terms of hours and/or word count) and the variation in the types included (in terms of the range of speakers or different contexts included and so on)
  • Requirements for multi-modal corpora
    • Quality : The resolution of the video and audio recordings, and the level of detail added to annotations and/or metadata recorded associated with these
    • Authenticity : How ‘natural’ or real is the data; is it scripted and/or structured or more spontaneous
    • Availability and (re)usability : Access rights to data, whether corpora are published and can be utilised and analysed by other researchers
  • DReSS II: Research aims and objectives
    • To produce digital records, which combine familiar forms of data with computational recordings of interaction
      • Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society
      • Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society
    • These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ?
  • Ubiquitous (heterogeneous) corpora
  • Data Types
    • ‘ Data’ types include (in progress):
      • SMS/ MMS messaging
      • Blogging
      • Chat room and message board discourse
      • Email Usage
      • Face-to-face situated discourse
      • GPS or manual map-based tracking
      • Web browsing activity (automated logging of sites)
      • Phone calls (home and/or mobile)
      • Text messaging
      • Video calls (mobile or online, e.g. Skype
  • A day in the life of your language
  • Software requirements
    • The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions.
    • Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such).
    • Variability in the provisions for transcription and the ability for, for example, representing simultaneous speech and speaker overlaps.
    • Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics.
  • DRS Demo- some new developments
  • Open questions
    • How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’?
    • How can we model context in a way that is meaningful to humans as well as to computers?
    • How will the development of heterogeneous corpora impact on:
        • Theories
        • Concepts
        • Methods
        • Guidelines
  • Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.