Uploaded on

- Knight, D., Carter, R. and Adolphs, S. …

- Knight, D., Carter, R. and Adolphs, S.

- Corpora and context: A discussion of ‘Thrill’.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Click the image to play a 2 minute(ish) Thrill video


  • 1. Corpora and context: A discussion of ‘Thrill’ Dawn Knight, Ronald Carter and Svenja Adolphs
  • 2. Corpus development at Nottingham
    • Mono-modal corpora
        • CANCODE
        • CANBEC
        • The Health Communication Corpus
    • Multi-modal corpora
        • The Nottingham Multi-Modal Corpus
        • The Nottingham Learner Corpora
    • Heterogeneous corpora
        • The Nottingham eLanguage Corpus (NeLC)
        • Feasibility corpora
  • 3. DRS: Digital Replay System (video demo)
  • 4. DReSS II: Research aims and objectives
    • To produce of digital records, which combine familiar forms of data with computational recordings of interaction
      • Not only to record novel forms of data but to develop means whereby social scientists can inspect the opaque character of social interaction and communication in the digital society
      • Development will be driven by an experimental project that seeks to explore a day in the life of a member of the digital society
    • These studies will be complemented by new forms of corpus analysis that go beyond existing techniques to ‘pump prime’ the development of the population observatory ?
  • 5. Nottingham eLanguage Corpus (NeLC)
    • ‘ Data’ types include (in progress):
      • SMS/ MMS messaging
      • Blogging
      • Chat room and message board discourse
      • Email Usage
      • Face-to-face situated discourse
      • GPS or manual map-based tracking
      • Web browsing activity (automated logging of sites)
      • Phone calls (home and/or mobile)
      • Text messaging
      • Video calls (mobile or online, e.g. Skype
  • 6. Challenges
    • Being able to model the dynamic spatial and temporal context of social activity and to use the knowledge gained for predictive support of user applications is one of the deepest longstanding challenges within ubiquitous computing.
    • Within the field of (Corpus) Linguistics a parallel challenge is to account systematically for how our language varies from one context to another according to dynamic changes in the environment, in channels of communication and the social context of human interaction.
  • 7. Capturing extra-contextual information
    • Brown notes that the following aspects of a discursive situation are important to consider in order to help conceptualise context (1989: 98):
      • Features of the external situation of utterance: speaker/ hearer, place / time etc
      • Background knowledge (particularly socio-cultural knowledge) which the learner brings to the interpretation of the discourse
      • Linguistic features of the discourse- particularly the range of vocabulary and syntactic structures, cohesiveness and rhetorical structures
  • 8. NeLC: SMS component
    • 10 logs so far (3M, 7F), over 6000 messages, 117000 words. Each ranging from 2 to 60 consecutive days.
      • Date and time sent/ received
      • Identity of sender receiver (age, occupation, basic relationship to participant)
      • Location when sent/received
      • Activity in location
      • Content of message
      • Linguistic function of the message
  • 9.  
  • 10. Ubiquitous (heterogeneous) corpora
  • 11. A day in the life of your language
  • 12. Key issues and Challenges
    • Refine the processes of data collection and collation.
        • - Automated extraction of content and/or metadata information (whole or part)
        • - Create methods for organising information (primary and secondary- from raw video to codes) within and across media types and specific datasets
    • B. Develop methods for dataset i nterrogation and mining
    • C. Compose system(s) for (re)presenting data
  • 13. Software requirements- 1
    • The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions.
  • 14. Software requirements- 2
    • Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such).
    • New methods for drilling into the data, through mining specific relationships within and between domain(s). This may be comparable to current social networking software, mind maps or more topologically based methods.
    • Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and domain specific characteristics.
  • 15. DRS Demo- some new developments
  • 16. Feasibility Corpora: Thrill
    • A 55,000 word corpus of fairground discourse, comprised of synchronised records of audio, video and sensory (i.e. heart rate) data.
      • 55 participants (mainly recorded in pairs)
      • 19 women, 26 men
      • Ages range from teens to late 50s
      • Over 11 hours video
  • 17. Feasibility Corpora: Thrill
    • Data has been transcribed and divided into 4 key phases:
      • Pre-ride phase (i.e. walking around the theme park)
      • The elevation of the ride
      • Start of the ride
      • Ride terminus
    • Aims:
      • To examine whether any patterns emerge in specific language used within/ across these phases.
      • To outline and test an appropriate to the analysis of heterogeneous data sets for linguistic enquiry.
  • 18. Feasibility Corpora: Thrill
  • 19.  
  • 20.  
  • 21.  
  • 22.  
  • 23.  
  • 24. Observations of Thrill- (oh) my god (Oh) my god is used 85 times by 21 different speakers in the corpus. It occurs most often at phases 2 and 3 of the ride- ride elevation and movement. Phase 3
  • 25. Thrill- summary comments
    • This analysis of this data provides a good, albeit crude, starting point for outlining an approach to the analysis of word use/ linguistic patterning across different forms of media and in terms of time, space and place.
    • We plan to look in more detail at the use of diectic markers within/across each of phase of the ride data. We also intend to conduct a more in-depth analysis of the relationship between linguistic behaviour and biometric responses.
  • 26. Open questions
    • How can we ensure that the data is sufficient, informative, useful……how much is ‘enough’?
    • How can we model context in a way that is meaningful to humans as well as to computers?
    • How will the development of heterogeneous corpora impact on:
        • Theories
        • Concepts
        • Methods
        • Guidelines
  • 27. Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.