IVACS 2010

Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Carter

DRS: Digital Replay System (video demo)

Current multi-modal corpora- key ‘types’ 1- Developed by researchers whose primary concern is on ‘how to improve human-computer interaction’ (Gu, 2006: 132). Here the corpus construction is seen an explorative exercise; to discover the breath and depth of multi-modal datasets that can be assembled, and how this is best achieved 2- Social science motivated studies, focusing on ‘multi-modal and multimedia studies of discourse’ where the primary focus us on a concern for the behaviour of ‘human beings’ (Gu, 2006: 132)

Multi-modal corpora- types (based on Oertel et al., 2010)

Limitations of multi-modal corpora Most are relatively small in size. The largest, the AMI (business meeting) corpus, contains 100 hours of video, but only a small number of these have been transcribed/annotated The majority record in lab-type settings, with fixed cameras and up to 4-5 static, seated participants Many are specialist and context/domain specific- the extent to which data is ‘naturalistic’ or ‘real-life’ can be challenged The availability of data and software is generally limited to institutions/funding bodies responsible for constructing datasets

Considerations for multi-modal corpora Design and infrastructure : Concerning what the data in the corpus looks like; what sorts of recordings and included and the basic design methodology used to collect, compile and represent this data Size and scope : Amount of data (in terms of hours and/or word count) and the variation in the types included (in terms of the range of speakers or different contexts included and so on)

Requirements for multi-modal corpora Quality : The resolution of the video and audio recordings, and the level of detail added to annotations and/or metadata recorded associated with these Authenticity : How ‘natural’ or real is the data; is it scripted and/or structured or more spontaneous Availability and (re)usability : Access rights to data, whether corpora are published and can be utilised and analysed by other researchers

DReSS II: Research aims and objectives To produce digital records, which combine familiar forms of data with computational recordings of interaction Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ?

Ubiquitous (heterogeneous) corpora

Data Types ‘ Data’ types include (in progress): SMS/ MMS messaging Blogging Chat room and message board discourse Email Usage Face-to-face situated discourse GPS or manual map-based tracking Web browsing activity (automated logging of sites) Phone calls (home and/or mobile) Text messaging Video calls (mobile or online, e.g. Skype

A day in the life of your language

Software requirements The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions. Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such). Variability in the provisions for transcription and the ability for, for example, representing simultaneous speech and speaker overlaps. Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics.

DRS Demo- some new developments

Open questions How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’? How can we model context in a way that is meaningful to humans as well as to computers? How will the development of heterogeneous corpora impact on: Theories Concepts Methods Guidelines

Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.

IVACS 2010

More Related Content

Similar to IVACS 2010

IVACS 2010

Editor's Notes