Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

  • 791 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
791
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Estimating Dyslexia in the WebRicardo Baeza-Yates Luz RelloYahoo! Research & Web Research andWeb Research Group, NLP GroupsPompeu Fabra University, Pompeu Fabra University,Barcelona, Spain Barcelona, Spain W4A 2011, Hyderabad
  • 2. Outline Outline — What — Why to distinguish dyslexic errors — How to build a sample to measure dyslexia — ResultsRicardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 3. What Outline Dyslexia is a neurologically-based disorder which Dyslexia interferes with the acquisition and processing of language. It manifests itself with difficulties in receptive and expressive language, including phonological processing, in reading, writing, spelling (The Boder’s Test and handwriting and sometimes in arithmetic. of Reading-Spelling Patterns) (Committee of Members Orton Dyslexia Society. Definition of Dyslexia, 1994.) The largest of the three subtypes of dyslexia that Dysphonetic the author presents. Dysphonetic dyslexia is dyslexia viewed as a disability in associating symbols with sounds. The misspellings typical of this disorder are due to phonetic inaccuracy. (Boder & Jarrico, 1982)Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 4. Why Outline There is a universal neuro-cognitive basis for dyslexia. (Paulesu et al. 2001) It manifestations are culture-specific due to All languages different orthographies. (Alegria, 2006) English is a language with deep orthography, the mapping between letters, speech sounds, and whole-word sounds is often highly ambiguous and therefore dyslexics examples are more widespread than in other languages with transparent or shallow orthography. (Paulesu et al. 2001)Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 5. Why Outline Researchers estimate that 10-17% of the population in the U.S.A. has dyslexia and only 30% of dyslexics have trouble with reversing letters and numbers. On the other hand, the level of dyslexia in other regions such as Europe or China is lower. Frequent (H. Meng et al., 2005) There are around 38 million of dyslexics in Europe. (Ruiz del Árbol, 2008)Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 6. Why Outline Detecting the presence of dyslexic texts in the Web helps us to know the real impact of dyslexia in the Web as well as to value dyslexic-accessible practices. Useful There is a common agreement in these studies that the application of dyslexic-accessible practices benefits also the readability for non-dyslexic users as well as other users with disabilities such as low vision. (McCarthy & Swierenga, 2010) (Evett & Brown, 2005) Spelling error rates has proven to be a useful index for website content quality. (Gelman & Barletta, 2008)Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 7. Why Outline Estimating dyslexia in a group of web pages depending on their domain. (Ringlstetter et al. 2006) Novel This is the first attempt to estimate the amount of texts containing English dyslexic errors in the Web.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 8. How Outline Two examples of dyslexic texts There seams to be some confusetion. Althrow he rembers the situartion, he is not clear on z detailes. With regard to deleteing parts, could you advice me of the excat nature of the promblem and I will investgate it imeaditly. I halve a spelling chequer It cam with my pea see Eye now I’ve gut the spilling rite Its plane fore al too sea ... I ts latter prefect awl the weigh My chequer tolled mi sew. (Pedler, 2007)Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 9. How Outline How many kinds of errors can be produced by a dyslexic? Simple errors 53% Multi errors 39% Word boundary errors 8% —— 100% dyslexic errors Real-word errors 17% Non-word errors 83% —— 100% First letter errors 5% (Pedler, 2007)Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 10. How Outline How many kinds of errors in the Web? 1. Dyslexic errors: Among the different kinds of errors commonly made made by dyslexics (i.e. unfinishedwords or letters, omitted words, inconsistent spaces between words and letters (Vellutino, 1979). *reiecve instead of receive 2. Regular spelling errors produced by non-impaired native English individuals, such as the transposition error, i.e. *recieve. 3. Regular typos caused by the adjacency of letters in the keyboard, i.e. *teceive. 4. OCR errors, due to letters of similar shape, such as *ieceive. 5. Errors made by non-native speakers who use English as a foreign language. For example, *receibe is a typical error made by Spanish learners of English, since the graphemes ‘b’ and ‘v’ are pronounced as /b/, and the phoneme /v/ does not exist in the standard Spanish phonemic system.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 11. How Outline Selection criteria To avoid the overlap of dyslexic errors and other errors: — We consider only words written by dyslexics containing multi- errors, that is, the dyslexic word differs from the intended correct word by more than one letter. For example, the dyslexic word *konwlegde from knowledge. To avoid the overlap of dyslexic errors and real words: — Errors which coincide with other existing words in English are omitted, i.e. *trust being the intended word truth. — Errors which give as a result a proper name are also filtered, for instance the typo *wirries from worries is also a proper name.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web in the
  • 12. How Outline Selection criteria — All the dyslexic spelling errors are extracted from samples of text written by adults with diagnosed dyslexia (extracted from a corpus compiled for this purpose) and from literature (Pedler, 2007). — Among the dyslexic errors, we take in account the ones which include the letters that produce more confusion among dyslexic individuals, such as ‘b’, ‘d’, ‘p’, ‘m’, ‘n’, ‘u’ and ‘w’ together with other similar looking letters. For instance, it is specially frequent to find reversals of similar letters, such as ‘b’ and ‘d’ (Deloche et al. 1982). i.e. *impossidle being the intended word impossible. — Errors due to homophone confusion, that is words which have a similar pronunciation (Pedler, 2007), are not selected even though 15% of the dyslexic errors presented homophone confusion in a corpus of dyslexic texts (witch and which).Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 13. How Outline Sample D, an example for the word comparison 1. Dyslexic error: *comaprsion. 2. Spelling errors: *comparision, *conparison and *coparison. 3. Typos: *vomparison, *xomparison, *cimparison, *cpmparison, *conparison, *co,parison, *comoarison, *com[arison, *comprison, *compsrison, *compaeison, *compatison, *comparuson, *comparoson, *compariaon,*comparidon, *comparisin, *comparispn, *comparisob and *comparisom. 4. OCR errors: *compaiison and *comparisom. 5. Non-native speakers *comparition and *comparizon. errors:Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 14. How Outline Sample D, dyslexic errors comparison *comaprsion understanding *understangind knowledge *knwolegde impossible *inpossbile tomorrow *torromow worries *worires explain *exaplin interesting *intersenting situation *situartion confusion *confusetionRicardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the
  • 15. How Outline Estimating Dyslexia in the Web — Let us define: f : fraction of Web pages with lexical errors. d : fraction of dyslexic errors among all lexical errors. — Then, the fraction of Web pages with dyslexia is f × d. — We find a lower bound for f and d, to obtain a lower bound for the fraction of dyslexic pages in the Web.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 16. How Outline Estimating Dyslexia in the Web — We use the main search engines (Bing, Google and Yahoo!) to estimate the document frequency of a word. — Each of the words in our list is searched only in English web pages to avoid cases of wrong words that may have a meaning in other language.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 17. How Outline Estimating Dyslexia in the Web — We bound the relative fraction of documents with lexical error, f, by using a sample of frequent words that appear in most documents, usually called stopwords in information retrieval (becuase, trhough, etc.). — We use the largest relative fraction of misspells for all these words to estimate f, as we cannot assume that all of them appear in different pages. — To bound d we do the same frequency search with a sample of non- frequent words (Sample D) where we can distinguish the different types of errors without ambiguity.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 18. Results Outline Estimating Dyslexia in the Web Range of percentages and average for the different error classes. We use the real document frequencies of the terms from one of the search engines to validate the results obtained, finding very similar results.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 19. Results Outline Estimating Dyslexia in the Web — From the sample D, the percentage of dyslexic errors among all lexical errors is very low with an average of 0.67% — From Pedler (2007), only 39% of dyslexics errors are multi-errors — This implies that the lower bound is at least d/0.39, but we can safely use a factor of 3 to correct this fact. — We have that f is at least 0.27% from the word becuase. — Then, we can estimate d as 2.01%. — Lower bound for dyslexia in the Web is 0.005%.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 20. Conclusions Outline • The amount of dyslexic texts in the Web is not as large as it could be. This suggests the idea that the widespread use of spell checkers ameliorates dyslexia in the Web. • Particular words can be used to detect dyslexic texts, and hence dyslexic users. This can be used to improve Web accessibility as well as future spell checkers or other tools targeted to dyslexic users.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 21. Conclusions Outline • Since this is the first attempt to estimate text written by dyslexics individuals in the Web, a comparison with previous work is not possible. • Previous research on dyslexia reveals that error frequency is related with word length (Pedler, 2007). Short words such as there, where, form, etc. are misspelled much more frequently in dyslexic texts than long words like the ones used in our experiments. Hence, we can do a better estimation by using a larger sample of stopwords as well as long dyslexic words. • As a byproduct we have found that other types of errors are much more frequent in the Web and this can be used to assess the quality of Web text.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 22. On-going Work Outline New methodology. Sample enlarged to 50 words. Real data extracted from a leading search engine. Up-down/Left-right typos. New lower bound: 0.8 % (16 times better). Range of percentages and average for the different error classes.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 23. Future Work Outline 1 — Identification of dyslexic errors. Dyslexia diagnosis. 2 — NLP techniques for making text more accessible for dyslexic users. 3 — Web quality estimation (Gelman & Barletta, 2008), across countries, domiens and social media.Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
  • 24. Outline Zank u beri machRicardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web