Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

香港六合彩 » SlideShare


Published on










Published in: Economy & Finance
  • Be the first to comment

  • Be the first to like this

香港六合彩 » SlideShare

  1. 1. Working Together: A Collaborative Approach to DIY Corpora Lynne Bowker University of Ottawa, Canada [email_address]
  2. 2. Overview <ul><li>Background </li></ul><ul><li>Experiment </li></ul><ul><li>Results </li></ul><ul><li>Discussion </li></ul><ul><li>Observations about WWW as corpus resource </li></ul><ul><li>Concluding remarks </li></ul>
  3. 3. Background <ul><li>Previous experience with corpus use in translation classroom: </li></ul><ul><ul><li>A single large “multipurpose” corpus </li></ul></ul><ul><ul><ul><li>One size doesn’t fit all… </li></ul></ul></ul><ul><ul><li>Individual corpora built by trainer </li></ul></ul><ul><ul><ul><li>Students not learning about corpus design and building; total exhaustion of trainer… </li></ul></ul></ul><ul><ul><li>Individual corpora built by students </li></ul></ul><ul><ul><ul><li>Small and poorly designed… </li></ul></ul></ul>
  4. 4. There has to be a better way… <ul><li>Inspiration from </li></ul><ul><ul><li>those who have shown that group work can work </li></ul></ul><ul><ul><ul><li>e.g. Maia, Varantola </li></ul></ul></ul><ul><ul><li>advocaters of a “learning-centred” approach </li></ul></ul><ul><ul><ul><li>e.g. Kiraly, Yuste </li></ul></ul></ul>
  5. 5. Collaborative approach <ul><li>Build one corpus per text/group of related texts </li></ul><ul><li>Entire class would contribute to each corpus </li></ul><ul><li>Parameters: </li></ul><ul><ul><li>a) coordinators, b) number of texts contributed, c) quality of texts, d) time frame, e) file format </li></ul></ul>
  6. 6. a) coordinators <ul><li>2 students per corpus </li></ul><ul><ul><li>Coordinators don’t have to contribute </li></ul></ul><ul><li>Receive corpus submissions from other students via email </li></ul><ul><ul><li>Special email account set up </li></ul></ul><ul><li>Act as “clearing house” </li></ul><ul><ul><li>evaluate relevance of texts and eliminate duplicates </li></ul></ul><ul><li>Collate remaining texts into corpus for posting on class website </li></ul>
  7. 7. b) number of texts contributed by each student per corpus <ul><li>Class size of 20-30 students </li></ul><ul><ul><li>22 students for this experiment </li></ul></ul><ul><li>Each student tries to contribute 3 relevant texts </li></ul><ul><ul><li>Can submit more if they find more </li></ul></ul><ul><li>3 x 22 = 66 – some duplicates = reasonable size corpus </li></ul><ul><ul><li>Likely to be larger than those previously created by individuals </li></ul></ul>
  8. 8. c) quality of texts <ul><li>Students must put time and care into text selection </li></ul><ul><ul><li>If everyone simply sends the first 3 hits found using Alta Vista: </li></ul></ul><ul><ul><ul><li>texts may not be relevant </li></ul></ul></ul><ul><ul><ul><li>there would be too much duplication </li></ul></ul></ul>
  9. 9. d) time frame <ul><li>Everyone needs a reasonable amount of time to do their job </li></ul><ul><ul><li>Trainer provides source text 3 weeks in advance </li></ul></ul><ul><ul><li>Students have 1 week to submit texts </li></ul></ul><ul><ul><li>Coordinators have 1 week to evaluate and collate texts </li></ul></ul><ul><ul><li>Everyone has 1 week to consult corpus </li></ul></ul>
  10. 10. e) file format <ul><li>Texts to be submitted as plain text files </li></ul><ul><ul><li>Easier for coordinators (don’t need to convert files or have access to different software packages) </li></ul></ul><ul><ul><li>Resulting corpus can be manipulated using a variety of corpus processing tools </li></ul></ul><ul><ul><li>Lower risk of spreading viruses </li></ul></ul>
  11. 11. Resulting corpora   Subject Text type Texts submitted Texts rejected Number of texts / words in corpus Passwords FAQ Web page 58 35 23 texts / 40,600 words Antivirus programs Instructional 78 22 56 texts / 170,919 words Encryption Informative/popularized 74 19 55 texts / 216,522 words Firewalls Buyer’s guide 63 18 45 texts / 136,017 words Steganography Product description 35 21 14 texts / 7,401 words Biometrics Research article 29 17 12 texts / 69,651 words Cookies Technical encyclopedia entry 41 19 22 texts / 11,754 words
  12. 12. Corpus 1: FAQ on passwords <ul><li>High degree of duplication </li></ul><ul><ul><li>All students used web (not library) </li></ul></ul><ul><ul><li>Most used Alta Vista search engine </li></ul></ul><ul><ul><li>Most simply took first 3 hits </li></ul></ul><ul><ul><li>Students were informed that different search engines index different pages; about meta-search engines </li></ul></ul><ul><ul><li>Agreed to consult more resources and look beyond first 3 hits </li></ul></ul>
  13. 13. Corpora 2 - 4 <ul><li>Popular subjects </li></ul><ul><ul><li>Viruses, encryption, firewalls </li></ul></ul><ul><li>Relatively common text types </li></ul><ul><ul><li>Instructional text, buyer’s guide, popularized articles </li></ul></ul><ul><ul><ul><li>lots of info available </li></ul></ul></ul><ul><ul><li>Some students submitted more than 3 texts </li></ul></ul><ul><ul><li>Less duplication than with corpus 1 </li></ul></ul>
  14. 14. Corpus 5: steganography <ul><li>Less common subject </li></ul><ul><ul><li>Not popular with “average” users </li></ul></ul><ul><li>Text type: product description </li></ul><ul><ul><li>Relatively few commercial packages available </li></ul></ul><ul><ul><li>Fewer texts to choose from </li></ul></ul><ul><ul><li>more judged “not relevant” (wrong text type) </li></ul></ul><ul><ul><ul><li>Students couldn’t find texts meeting all the criteria but wanted to submit something so they chose anything at all on the subject of steganography </li></ul></ul></ul>
  15. 15. Corpus 6: biometrics <ul><li>Recent research article </li></ul><ul><ul><li>Many links looked promising, but required paid subscription </li></ul></ul><ul><ul><li>Free texts were “older” (not state of the art) </li></ul></ul><ul><ul><li>Relatively few texts submitted </li></ul></ul><ul><ul><ul><li>But texts were long so word count relatively high </li></ul></ul></ul>
  16. 16. Corpus 7: cookies <ul><li>Online technical encyclopedia entry </li></ul><ul><ul><li>Limited number of comparable texts </li></ul></ul><ul><ul><li>Texts were quite short </li></ul></ul><ul><ul><ul><li>Low word count </li></ul></ul></ul>
  17. 17. Observations about using Web as a resource for corpora <ul><li>great resource on the whole, but does have some limitations </li></ul><ul><ul><li>Sometimes overwhelmed by information </li></ul></ul><ul><ul><ul><li>Must formulate queries carefully to reduce noise </li></ul></ul></ul><ul><ul><ul><li>Think about criteria beyond subject (e.g. type) </li></ul></ul></ul><ul><ul><ul><ul><li>“ cookies” vs +cookies +encyclopedia </li></ul></ul></ul></ul><ul><ul><li>Sometimes underwhelmed by information </li></ul></ul><ul><ul><ul><li>Try same query using different search engines </li></ul></ul></ul>
  18. 18. <ul><li>Quality control </li></ul><ul><ul><li>Anyone can post material – be selective </li></ul></ul><ul><ul><li>Seen as an ephemeral resource </li></ul></ul><ul><li>Limited range of text types available </li></ul><ul><ul><li>General interest  widely available, free </li></ul></ul><ul><ul><li>Specialized  more limited selection, subscription </li></ul></ul><ul><li>Nature of the web </li></ul><ul><ul><li>Good web design not conducive to easy corpus building  hyperlinked documents time-consuming to download </li></ul></ul><ul><ul><li>Multimedia texts not always suitable for text-based corpora </li></ul></ul>
  19. 19. Concluding remarks <ul><li>An overall success </li></ul><ul><ul><li>Corpora were more useful than either the “multipurpose” corpus or the corpora built by individual students </li></ul></ul><ul><ul><ul><li>General improvement in quality of translations </li></ul></ul></ul><ul><ul><li>Shift in pedagogical strategy gave students opportunity to become independent learners </li></ul></ul><ul><ul><ul><li>Reflect on suitability of resources </li></ul></ul></ul><ul><ul><ul><li>Reflect on issues of text type </li></ul></ul></ul><ul><ul><li>Students were positive about the experience </li></ul></ul>