Your SlideShare is downloading. ×
Discourse on Disk: tales from the scottish corpus, David Beavan, University of Glasgow, 2007
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Discourse on Disk: tales from the scottish corpus, David Beavan, University of Glasgow, 2007


Published on

Discourse on Disk: tales from the scottish corpus. Story of the building of the Scottish Corpus of Texts and Speech from 2007. …

Discourse on Disk: tales from the scottish corpus. Story of the building of the Scottish Corpus of Texts and Speech from 2007.
Web site:

Published in: Education, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Dave BeavanDepartment of English Language University of Glasgow
  • 2. Discourse on disk: Tales from the Scottish Corpus
  • 3. Scottish Corpus§  Scottish Corpus of Texts & Speech (SCOTS)§  Modern day (1945 to present)§  Scottish English and Scots§  Written and spoken documents§  Wide range of genres§  4,000,000 word corpus§  20% (800,000 word) spoken
  • 4. Scottish Corpus§  Externally funded •  2 years Engineering & Physical Sciences Research Council (EPSRC) in collaboration with Language Technology Group, University of Edinburgh •  3 years Arts & Humanities Research Council (AHRC)§  Project completed June 2007
  • 5. The team
  • 6. Distinctive features§  Free – no cost§  Free – no access limitations§  Entirely internet delivered§  Must solicit all corpus contents§  Legal concerns •  Copyright •  Data protection
  • 7. Comparison SCOTS ICE-GB BNCCost Free £750 £500Words 4m 1m 100mRegional üSpoken ü ü üAudio ü üCD-ROM ü üInternet ü ** 3rd party tools available
  • 8. Corpus building§  Contact management§  Tracking of paperwork§  Mail merges§  Analysis and ad-hoc reports§  Ensure adherence to legal issues§  Entry of corpus data§  All under one roof
  • 9. Website§  Access to corpus data§  Search and browse functions§  Common analysis features§  Focus on usability§  Employ new web technologies§  Mash it up
  • 10. Transcriptions§  Orthographic transcription§  Multiple speakers •  Overlap •  Interruption§  Synchronise •  Orthographic transcription •  Audio/video footage
  • 11. Maps§  Geographic data held •  Place of birth/residence •  Place of birth for mother/father§  Normalise locations§  Source map imagery§  Incorporate navigation§  Google Maps mashup
  • 12. Mashup +
  • 13. Mashup + =
  • 14. Is this Glasgow?Flat x/x xx Dumbarton Roadxx Fergus Drive PartickNorth Kelvinside North LanarkshireGlasgow Gxx xxxStrathclydeGxx xxx xx Great Western Road Anniesland Glasgow Gxx xxx
  • 15. Maps§  Locations •  Postcodes too local •  Counties too general •  Cities/towns/villages good compromise§  Ordnance Survey Gazetteer •  Automatically find equivalents •  Manually assign ‘awkward’ cases
  • 16. Public awareness§  Media •  Press releases •  Magazine articles •  Radio interviews •  Online articles§  Web •  Open up entire corpus to search engine •  Submit site in directories and catalogues
  • 17. Popular authorsburns 99852 66begg 22corbett 15leonard 15the driver 15blackhall 14gibbon 14scott 14harrison 12
  • 18. Popular documentsSKARRS 1599een fur d pyoorists 1490The Annals of Arbuthnott Part One 763A Blether an wee Bevvy 703Buchan Words and Ways 626A Bit of a Wally 554BBC Voices Recording: Glasgow 542Learning Through Reading and Writing: 535Informational TextsA Sair Fecht 508Katie Morag and the Two Grandmothers 476
  • 19. Web hits1000000900000800000700000600000500000400000300000200000100000 0 Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07 Nov-07
  • 20. Emails“Im curious as to why none of my Scots work isincluded in the corpus. Im fairly well known as aScots-writing poet and anthologist. I recordedpoems for <person> some years ago and assumedthey would be here. Maybe Im not searchingproperly but <surname> seems to be a non-personhere.”“<long list of books> … Which one of us is missingthese wonderful works? I hope its me for if its youthis corpus is sorely under developed.”
  • 21. Emails“I dont know how to spell the word - it sounds like"bahouchie" and I think it means "behind" orrudely -"bum". What is the correct spelling and isthe meaning correct?”“This is just a note to thank the people who workedso very hard to bring this online resource to theworld. Its a fabulous help, and I, for one, amincredibly grateful! Thank you.”“Think Im going to have to download the wholething very very soon...”