Discourse on Disk: tales from the scottish corpus, David Beavan, University of Glasgow, 2007


Published on

Discourse on Disk: tales from the scottish corpus. Story of the building of the Scottish Corpus of Texts and Speech http://www.scottishcorpus.ac.uk/ from 2007.
Web site: http://www.scottishcorpus.ac.uk/

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Discourse on Disk: tales from the scottish corpus, David Beavan, University of Glasgow, 2007

  1. 1. Dave BeavanDepartment of English Language University of Glasgow www.scottishcorpus.ac.uk
  2. 2. Discourse on disk: Tales from the Scottish Corpus
  3. 3. Scottish Corpus§  Scottish Corpus of Texts & Speech (SCOTS)§  Modern day (1945 to present)§  Scottish English and Scots§  Written and spoken documents§  Wide range of genres§  4,000,000 word corpus§  20% (800,000 word) spoken
  4. 4. Scottish Corpus§  Externally funded •  2 years Engineering & Physical Sciences Research Council (EPSRC) in collaboration with Language Technology Group, University of Edinburgh •  3 years Arts & Humanities Research Council (AHRC)§  Project completed June 2007
  5. 5. The team
  6. 6. Distinctive features§  Free – no cost§  Free – no access limitations§  Entirely internet delivered§  Must solicit all corpus contents§  Legal concerns •  Copyright •  Data protection
  7. 7. Comparison SCOTS ICE-GB BNCCost Free £750 £500Words 4m 1m 100mRegional üSpoken ü ü üAudio ü üCD-ROM ü üInternet ü ** 3rd party tools available
  8. 8. Corpus building§  Contact management§  Tracking of paperwork§  Mail merges§  Analysis and ad-hoc reports§  Ensure adherence to legal issues§  Entry of corpus data§  All under one roof
  9. 9. Website§  Access to corpus data§  Search and browse functions§  Common analysis features§  Focus on usability§  Employ new web technologies§  Mash it up
  10. 10. Transcriptions§  Orthographic transcription§  Multiple speakers •  Overlap •  Interruption§  Synchronise •  Orthographic transcription •  Audio/video footage
  11. 11. Maps§  Geographic data held •  Place of birth/residence •  Place of birth for mother/father§  Normalise locations§  Source map imagery§  Incorporate navigation§  Google Maps mashup
  12. 12. Mashup +
  13. 13. Mashup + =
  14. 14. Is this Glasgow?Flat x/x xx Dumbarton Roadxx Fergus Drive PartickNorth Kelvinside North LanarkshireGlasgow Gxx xxxStrathclydeGxx xxx xx Great Western Road Anniesland Glasgow Gxx xxx
  15. 15. Maps§  Locations •  Postcodes too local •  Counties too general •  Cities/towns/villages good compromise§  Ordnance Survey Gazetteer •  Automatically find equivalents •  Manually assign ‘awkward’ cases
  16. 16. Public awareness§  Media •  Press releases •  Magazine articles •  Radio interviews •  Online articles§  Web •  Open up entire corpus to search engine •  Submit site in directories and catalogues
  17. 17. Popular authorsburns 99852 66begg 22corbett 15leonard 15the driver 15blackhall 14gibbon 14scott 14harrison 12
  18. 18. Popular documentsSKARRS 1599een fur d pyoorists 1490The Annals of Arbuthnott Part One 763A Blether an wee Bevvy 703Buchan Words and Ways 626A Bit of a Wally 554BBC Voices Recording: Glasgow 542Learning Through Reading and Writing: 535Informational TextsA Sair Fecht 508Katie Morag and the Two Grandmothers 476
  19. 19. Web hits1000000900000800000700000600000500000400000300000200000100000 0 Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07 Nov-07
  20. 20. Emails“Im curious as to why none of my Scots work isincluded in the corpus. Im fairly well known as aScots-writing poet and anthologist. I recordedpoems for <person> some years ago and assumedthey would be here. Maybe Im not searchingproperly but <surname> seems to be a non-personhere.”“<long list of books> … Which one of us is missingthese wonderful works? I hope its me for if its youthis corpus is sorely under developed.”
  21. 21. Emails“I dont know how to spell the word - it sounds like"bahouchie" and I think it means "behind" orrudely -"bum". What is the correct spelling and isthe meaning correct?”“This is just a note to thank the people who workedso very hard to bring this online resource to theworld. Its a fabulous help, and I, for one, amincredibly grateful! Thank you.”“Think Im going to have to download the wholething very very soon...”