Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IVACS 2010


Published on

- Adolphs, S., Carter, R. and Knight, D.

- Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis.

  • Be the first to comment

  • Be the first to like this

IVACS 2010

  1. 1. Second phase multi-modal corpora: Heterogeneous datasets for linguistic analysis Dawn Knight, Svenja Adolphs and Ronald Carter
  2. 2. DRS: Digital Replay System (video demo)
  3. 3. Current multi-modal corpora- key ‘types’ <ul><li>1- Developed by researchers whose primary concern is on ‘how to improve human-computer interaction’ (Gu, 2006: 132). Here the corpus construction is seen an explorative exercise; to discover the breath and depth of multi-modal datasets that can be assembled, and how this is best achieved </li></ul><ul><li>2- Social science motivated studies, focusing on ‘multi-modal and multimedia studies of discourse’ where the primary focus us on a concern for the behaviour of ‘human beings’ (Gu, 2006: 132) </li></ul>
  4. 4. Multi-modal corpora- types (based on Oertel et al., 2010)
  5. 5. Limitations of multi-modal corpora <ul><li>Most are relatively small in size. The largest, the AMI (business meeting) corpus, contains 100 hours of video, but only a small number of these have been transcribed/annotated </li></ul><ul><li>The majority record in lab-type settings, with fixed cameras and up to 4-5 static, seated participants </li></ul><ul><li>Many are specialist and context/domain specific- the extent to which data is ‘naturalistic’ or ‘real-life’ can be challenged </li></ul><ul><li>The availability of data and software is generally limited to institutions/funding bodies responsible for constructing datasets </li></ul>
  6. 6. Considerations for multi-modal corpora <ul><li>Design and infrastructure : Concerning what the data in the corpus looks like; what sorts of recordings and included and the basic design methodology used to collect, compile and represent this data </li></ul><ul><li>Size and scope : Amount of data (in terms of hours and/or word count) and the variation in the types included (in terms of the range of speakers or different contexts included and so on) </li></ul>
  7. 7. Requirements for multi-modal corpora <ul><li>Quality : The resolution of the video and audio recordings, and the level of detail added to annotations and/or metadata recorded associated with these </li></ul><ul><li>Authenticity : How ‘natural’ or real is the data; is it scripted and/or structured or more spontaneous </li></ul><ul><li>Availability and (re)usability : Access rights to data, whether corpora are published and can be utilised and analysed by other researchers </li></ul>
  8. 8. DReSS II: Research aims and objectives <ul><li>To produce digital records, which combine familiar forms of data with computational recordings of interaction </li></ul><ul><ul><li>Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society </li></ul></ul><ul><ul><li>Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society </li></ul></ul><ul><li>These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ? </li></ul>
  9. 9. Ubiquitous (heterogeneous) corpora
  10. 10. Data Types <ul><li>‘ Data’ types include (in progress): </li></ul><ul><ul><li>SMS/ MMS messaging </li></ul></ul><ul><ul><li>Blogging </li></ul></ul><ul><ul><li>Chat room and message board discourse </li></ul></ul><ul><ul><li>Email Usage </li></ul></ul><ul><ul><li>Face-to-face situated discourse </li></ul></ul><ul><ul><li>GPS or manual map-based tracking </li></ul></ul><ul><ul><li>Web browsing activity (automated logging of sites) </li></ul></ul><ul><ul><li>Phone calls (home and/or mobile) </li></ul></ul><ul><ul><li>Text messaging </li></ul></ul><ul><ul><li>Video calls (mobile or online, e.g. Skype </li></ul></ul>
  11. 11. A day in the life of your language
  12. 12. Software requirements <ul><li>The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions. </li></ul><ul><li>Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such). </li></ul><ul><li>Variability in the provisions for transcription and the ability for, for example, representing simultaneous speech and speaker overlaps. </li></ul><ul><li>Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics. </li></ul>
  13. 13. DRS Demo- some new developments
  14. 14. Open questions <ul><li>How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’? </li></ul><ul><li>How can we model context in a way that is meaningful to humans as well as to computers? </li></ul><ul><li>How will the development of heterogeneous corpora impact on: </li></ul><ul><ul><ul><li>Theories </li></ul></ul></ul><ul><ul><ul><li>Concepts </li></ul></ul></ul><ul><ul><ul><li>Methods </li></ul></ul></ul><ul><ul><ul><li>Guidelines </li></ul></ul></ul>
  15. 15. Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.