Your SlideShare is downloading. ×
0
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
IVACS Symposium 2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

IVACS Symposium 2010

518

Published on

- Knight, D. …

- Knight, D.

- Language, Corpora and Context: A \'Thrilling\' Case Study.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
518
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • SVENJA
  • Click the image to play a 2 minute(ish) Thrill video
  • Transcript

    • 1. Language, Corpora and Context: A ‘Thrilling’ Case Study Dawn Knight The University of Nottingham
    • 2. Corpus development at Nottingham <ul><li>Mono-modal corpora </li></ul><ul><ul><ul><li>CANCODE </li></ul></ul></ul><ul><ul><ul><li>CANBEC </li></ul></ul></ul><ul><ul><ul><li>The Health Communication Corpus </li></ul></ul></ul><ul><li>Multi-modal corpora </li></ul><ul><ul><ul><li>The Nottingham Multi-Modal Corpus </li></ul></ul></ul><ul><ul><ul><li>The Nottingham Learner Corpora </li></ul></ul></ul><ul><li>Heterogeneous corpora </li></ul><ul><ul><ul><li>The Nottingham eLanguage Corpus (NeLC) </li></ul></ul></ul><ul><ul><ul><li>Feasibility corpora </li></ul></ul></ul>
    • 3. Challenges <ul><li>Being able to model the dynamic spatial and temporal context of social activity and to use the knowledge gained for predictive support of user applications is one of the deepest longstanding challenges within ubiquitous computing. </li></ul><ul><li>Within the field of (Corpus) Linguistics a parallel challenge is to account systematically for how our language varies from one context to another according to dynamic changes in the environment, in channels of communication and the social context of human interaction. </li></ul>
    • 4. NeLC: SMS component <ul><li>5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: </li></ul><ul><ul><li>F1: 1870 SMSs, 37,772 words over 50 days </li></ul></ul><ul><ul><li>F2: 1505 SMSs, 32,137 words over 55 days </li></ul></ul><ul><ul><li>F3: 100 SMSs, 1879 words over 20 days </li></ul></ul><ul><ul><li>F4: 52 SMSs, 732 words over 2 days </li></ul></ul><ul><ul><li>M1: 165 SMSs, 1674 words over 50 days </li></ul></ul><ul><li>Logs detail the following: </li></ul><ul><ul><li>Content, date, time, sender and receiver details (age, gender, occupation, relationship), location and activity in location </li></ul></ul>
    • 5. NeLC: SNSs <ul><li>2589 updates from 167 contributors in 1 user log (circa 31,300 words over 50 days) </li></ul><ul><li>Content from 43 ‘celebrity’ updates extracted over a 50 day period. 16,857 in total (more than 250,000 words) </li></ul><ul><li>Both forms of logs include: </li></ul><ul><ul><li>Time and date submitted, contributor information (gender, age, occupation), network/ location and content (i.e. the text) </li></ul></ul>
    • 6. NeLC: Other Text-based data <ul><li>Blogs and feedback forums taken from Newspaper and broadcasting websites- data has been extracted but not yet processed. </li></ul><ul><li>Possible plans for collecting email data, in collaboration with extended ‘a day in the life of your language’ data collection. </li></ul><ul><li>Texts require anonymising and (possibly) standardising? – also ‘separating’? </li></ul>
    • 7. A Day in the Life of your Language <ul><li>Associated metadata also recorded </li></ul><ul><li>Data requires transcription </li></ul>
    • 8. Representing data- requirements <ul><li>The ability to search data and metadata in a principled and specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions (i.e. the production of ‘sub-corpora’) </li></ul><ul><li>Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such) </li></ul><ul><li>New methods for drilling into the data, through mining specific relationships within and between domain(s). This may be comparable to current social networking software, mind maps or more topologically based methods </li></ul><ul><li>Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and </li></ul><ul><li>domain specific characteristics </li></ul>
    • 9. DRS: Digital Replay System (video demo)
    • 10. Future developments in DRS- mock-up 1
    • 11. Future developments in DRS- mock-up 1
    • 12. Feasibility Corpora: Thrill <ul><li>A 55,000 word corpus of fairground discourse, comprised of synchronised records of audio, video and sensory (i.e. heart rate) data. </li></ul><ul><ul><li>55 participants (mainly recorded in pairs) </li></ul></ul><ul><ul><li>19 women, 26 men </li></ul></ul><ul><ul><li>Ages range from teens to late 50s </li></ul></ul><ul><ul><li>Over 11 hours video </li></ul></ul>
    • 13. Feasibility Corpora: Thrill <ul><li>Data has been transcribed and divided into 4 key phases: </li></ul><ul><ul><li>Pre-ride phase (i.e. walking around the theme park) </li></ul></ul><ul><ul><li>The elevation of the ride </li></ul></ul><ul><ul><li>Start of the ride </li></ul></ul><ul><ul><li>Ride terminus </li></ul></ul><ul><li>Aims: </li></ul><ul><ul><li>To examine whether any patterns emerge in specific language used within/ across these phases. </li></ul></ul><ul><ul><li>To outline and test an appropriate to the analysis of heterogeneous data sets for linguistic enquiry. </li></ul></ul>
    • 14. Feasibility Corpora: Thrill
    • 15. &nbsp;
    • 16. Phase 1: Frequency based word clouds Phase 2: Frequency based word clouds Phase 3: Frequency based word clouds Phase 4: Frequency based word clouds
    • 17. Next Steps <ul><li>More detailed analyses/ comparisons of the data. </li></ul><ul><li>Extend the analyses, exploring more varied datasets from NeLC and ‘A day in the life of your language’. </li></ul><ul><li>Collect more data (using a wider cross section of different participants), using the mobile toolkit. </li></ul>
    • 18. Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.

    ×