Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Carter
DRS: Digital Replay System (video demo)
Current multi-modal corpora- key ‘types’ 1- Developed by researchers whose primary concern is on ‘how to improve human-computer interaction’ (Gu, 2006: 132). Here the corpus construction is seen an explorative exercise; to discover the breath and depth of multi-modal datasets that can be assembled, and how this is best achieved 2- Social science motivated studies, focusing on ‘multi-modal and multimedia studies of discourse’ where the primary focus us on a concern for the behaviour of ‘human beings’ (Gu, 2006: 132)
Multi-modal corpora- types  (based on Oertel et al., 2010)
Limitations of multi-modal corpora Most are relatively small in size. The largest, the AMI (business meeting) corpus, contains 100 hours of video, but only a small number of these have been transcribed/annotated The majority record in lab-type settings, with fixed cameras and up to 4-5 static, seated participants Many are specialist and context/domain specific- the extent to which data is ‘naturalistic’ or ‘real-life’ can be challenged The availability of data and software is generally limited to institutions/funding bodies responsible for constructing datasets
Considerations for multi-modal corpora Design and infrastructure : Concerning what the data in the corpus looks like; what sorts of recordings and included and the basic design methodology used to collect, compile and represent this data Size and scope : Amount of data (in terms of hours and/or word count) and the variation in the types included (in terms of the range of speakers or different contexts included and so on)
Requirements for multi-modal corpora Quality : The resolution of the video and audio recordings, and the level of detail added to annotations and/or metadata recorded associated with these Authenticity : How ‘natural’ or real is the data; is it scripted and/or structured or more spontaneous  Availability and (re)usability : Access rights to data, whether corpora are published and can  be utilised and analysed by other researchers
DReSS II: Research aims and objectives To produce digital records, which combine familiar forms of data with computational recordings of interaction Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ?
Ubiquitous (heterogeneous) corpora
Data Types ‘ Data’ types include (in progress): SMS/ MMS messaging Blogging Chat room and message board discourse Email Usage Face-to-face situated discourse GPS or manual map-based tracking Web browsing activity (automated logging of sites) Phone calls (home and/or mobile) Text messaging Video calls (mobile or online, e.g. Skype
A day in the life of your language
Software requirements The ability to search data  and  metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions. Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such). Variability in the provisions for transcription and the ability for, for example, representing simultaneous speech and speaker overlaps. Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics.
DRS Demo- some new developments
Open questions How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’? How can we model context in a way that is meaningful to humans as well as to computers? How will the development of heterogeneous corpora impact on: Theories Concepts Methods Guidelines
Acknowledgements   Research team The Digital Records for e-Social Science Project is funded by the ESRC.

IVACS 2010

  • 1.
    Second phase multi-modalcorpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Carter
  • 2.
    DRS: Digital ReplaySystem (video demo)
  • 3.
    Current multi-modal corpora-key ‘types’ 1- Developed by researchers whose primary concern is on ‘how to improve human-computer interaction’ (Gu, 2006: 132). Here the corpus construction is seen an explorative exercise; to discover the breath and depth of multi-modal datasets that can be assembled, and how this is best achieved 2- Social science motivated studies, focusing on ‘multi-modal and multimedia studies of discourse’ where the primary focus us on a concern for the behaviour of ‘human beings’ (Gu, 2006: 132)
  • 4.
    Multi-modal corpora- types (based on Oertel et al., 2010)
  • 5.
    Limitations of multi-modalcorpora Most are relatively small in size. The largest, the AMI (business meeting) corpus, contains 100 hours of video, but only a small number of these have been transcribed/annotated The majority record in lab-type settings, with fixed cameras and up to 4-5 static, seated participants Many are specialist and context/domain specific- the extent to which data is ‘naturalistic’ or ‘real-life’ can be challenged The availability of data and software is generally limited to institutions/funding bodies responsible for constructing datasets
  • 6.
    Considerations for multi-modalcorpora Design and infrastructure : Concerning what the data in the corpus looks like; what sorts of recordings and included and the basic design methodology used to collect, compile and represent this data Size and scope : Amount of data (in terms of hours and/or word count) and the variation in the types included (in terms of the range of speakers or different contexts included and so on)
  • 7.
    Requirements for multi-modalcorpora Quality : The resolution of the video and audio recordings, and the level of detail added to annotations and/or metadata recorded associated with these Authenticity : How ‘natural’ or real is the data; is it scripted and/or structured or more spontaneous Availability and (re)usability : Access rights to data, whether corpora are published and can be utilised and analysed by other researchers
  • 8.
    DReSS II: Researchaims and objectives To produce digital records, which combine familiar forms of data with computational recordings of interaction Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ?
  • 9.
  • 10.
    Data Types ‘Data’ types include (in progress): SMS/ MMS messaging Blogging Chat room and message board discourse Email Usage Face-to-face situated discourse GPS or manual map-based tracking Web browsing activity (automated logging of sites) Phone calls (home and/or mobile) Text messaging Video calls (mobile or online, e.g. Skype
  • 11.
    A day inthe life of your language
  • 12.
    Software requirements Theability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions. Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such). Variability in the provisions for transcription and the ability for, for example, representing simultaneous speech and speaker overlaps. Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics.
  • 13.
    DRS Demo- somenew developments
  • 14.
    Open questions Howcan we ensure that the data is sufficient, informative, useful……how much is ‘enough’? How can we model context in a way that is meaningful to humans as well as to computers? How will the development of heterogeneous corpora impact on: Theories Concepts Methods Guidelines
  • 15.
    Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.

Editor's Notes