Your SlideShare is downloading. ×
Discourse on Disk: tales from the scottish corpus, David Beavan, University of Glasgow, 2007
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Discourse on Disk: tales from the scottish corpus, David Beavan, University of Glasgow, 2007

182
views

Published on

Discourse on Disk: tales from the scottish corpus. Story of the building of the Scottish Corpus of Texts and Speech http://www.scottishcorpus.ac.uk/ from 2007. …

Discourse on Disk: tales from the scottish corpus. Story of the building of the Scottish Corpus of Texts and Speech http://www.scottishcorpus.ac.uk/ from 2007.
Web site: http://www.scottishcorpus.ac.uk/

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
182
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Dave BeavanDepartment of English Language University of Glasgow www.scottishcorpus.ac.uk
  • 2. Discourse on disk: Tales from the Scottish Corpus
  • 3. Scottish Corpus§  Scottish Corpus of Texts & Speech (SCOTS)§  Modern day (1945 to present)§  Scottish English and Scots§  Written and spoken documents§  Wide range of genres§  4,000,000 word corpus§  20% (800,000 word) spoken
  • 4. Scottish Corpus§  Externally funded •  2 years Engineering & Physical Sciences Research Council (EPSRC) in collaboration with Language Technology Group, University of Edinburgh •  3 years Arts & Humanities Research Council (AHRC)§  Project completed June 2007
  • 5. The team
  • 6. Distinctive features§  Free – no cost§  Free – no access limitations§  Entirely internet delivered§  Must solicit all corpus contents§  Legal concerns •  Copyright •  Data protection
  • 7. Comparison SCOTS ICE-GB BNCCost Free £750 £500Words 4m 1m 100mRegional üSpoken ü ü üAudio ü üCD-ROM ü üInternet ü ** 3rd party tools available
  • 8. Corpus building§  Contact management§  Tracking of paperwork§  Mail merges§  Analysis and ad-hoc reports§  Ensure adherence to legal issues§  Entry of corpus data§  All under one roof
  • 9. Website§  Access to corpus data§  Search and browse functions§  Common analysis features§  Focus on usability§  Employ new web technologies§  Mash it up
  • 10. Transcriptions§  Orthographic transcription§  Multiple speakers •  Overlap •  Interruption§  Synchronise •  Orthographic transcription •  Audio/video footage
  • 11. Maps§  Geographic data held •  Place of birth/residence •  Place of birth for mother/father§  Normalise locations§  Source map imagery§  Incorporate navigation§  Google Maps mashup
  • 12. Mashup +
  • 13. Mashup + =
  • 14. Is this Glasgow?Flat x/x xx Dumbarton Roadxx Fergus Drive PartickNorth Kelvinside North LanarkshireGlasgow Gxx xxxStrathclydeGxx xxx xx Great Western Road Anniesland Glasgow Gxx xxx
  • 15. Maps§  Locations •  Postcodes too local •  Counties too general •  Cities/towns/villages good compromise§  Ordnance Survey Gazetteer •  Automatically find equivalents •  Manually assign ‘awkward’ cases
  • 16. Public awareness§  Media •  Press releases •  Magazine articles •  Radio interviews •  Online articles§  Web •  Open up entire corpus to search engine •  Submit site in directories and catalogues
  • 17. Popular authorsburns 99852 66begg 22corbett 15leonard 15the driver 15blackhall 14gibbon 14scott 14harrison 12
  • 18. Popular documentsSKARRS 1599een fur d pyoorists 1490The Annals of Arbuthnott Part One 763A Blether an wee Bevvy 703Buchan Words and Ways 626A Bit of a Wally 554BBC Voices Recording: Glasgow 542Learning Through Reading and Writing: 535Informational TextsA Sair Fecht 508Katie Morag and the Two Grandmothers 476
  • 19. Web hits1000000900000800000700000600000500000400000300000200000100000 0 Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07 Nov-07
  • 20. Emails“Im curious as to why none of my Scots work isincluded in the corpus. Im fairly well known as aScots-writing poet and anthologist. I recordedpoems for <person> some years ago and assumedthey would be here. Maybe Im not searchingproperly but <surname> seems to be a non-personhere.”“<long list of books> … Which one of us is missingthese wonderful works? I hope its me for if its youthis corpus is sorely under developed.”
  • 21. Emails“I dont know how to spell the word - it sounds like"bahouchie" and I think it means "behind" orrudely -"bum". What is the correct spelling and isthe meaning correct?”“This is just a note to thank the people who workedso very hard to bring this online resource to theworld. Its a fabulous help, and I, for one, amincredibly grateful! Thank you.”“Think Im going to have to download the wholething very very soon...”