SlideShare a Scribd company logo
1 of 23
Download to read offline
On Measuring the Lexical Quality of the Web

     Ricardo Baeza-Yates                    Luz Rello

     Yahoo! Research &
                                            Web Research & NLP Groups
     Web Research Group,
                                            Universitat Pompeu Fabra
     Universitat Pompeu Fabra
                                            Barcelona, Spain
     Barcelona, Spain




      WICOW/AIRWeb Workshop on Web Quality -- April 16, 2012, Lyon
Outline
                                         Outline


           — Motivation

           — Related Work

                                                                              — English
           — Measuring Lexical Quality                                        — Spanish
                                                              — the Web
                                                              — major Internet domains
           — Lexical Quality
                                                              — social media
                                                              — geographical distribution
           — Conclusions


Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Motivation
                                        Outline


      Some Facts...

         Measuring the quality of a web page is one of the key problems for web search
         engines

         Intrinsic quality depends on semantic quality, which is very hard to measure

         Many proxies for the real quality were proposed first in information retrieval based
         on the use of words and later in the Web, using link analysis and click-through data
                                                   (R. Baeza-Yates & B. Ribeiro-Neto, 2011)

         Previous work had shown that there is a correlation between spelling errors and web
         data content quality
                                                     (Gelman & Barletta, WICOW 2008)




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Related Work
                                          Outline


                                Content:
                                — accuracy, source reputation, objectivity, highly current, ...
                                e.g. spam detection or             community feedback, user
                                source credibility.                interactions, click counts, ...
   Web Quality
                                                                                          Lexical
                                                                                          Quality

                                Representation:
                                — legibility, spelling errors, grammatical errors, ...
                                Spelling error rate as a                   They use a set of ten
                                metric to indicate the                     frequently misspelled
                                degree of quality of                       words and hit counts of a
                                websites                                   search engine

                                                         (Gelman & Barletta, WICOW 2008)

Baeza-Yates, R. and Rello, L.       Web Quality 2011, Lyon    On Measuring the Lexical Quality of the Web
Lexical Quality
                                        Outline


                                Lexical quality mainly refers to the
                                degree of excellence of words in a text

     Lexical
                                It impacts the reader’s understanding and it is also related to
     Quality                    textual accessibility, as text with errors is read slower by all
                                people
                                                (Rello & Baeza-Yates, WWW 2012 poster)

                                A lexical representation has high quality to the extent that it has a
                                fully specified orthographic representation (a spelling) and
                                redundant phonological representations (one from spoken
                                language and one recoverable from orthographies-to-phonological
                                mapping)

                                                                             (Perfetti & Hart, 2002)

Baeza-Yates, R. and Rello, L.     Web Quality 2011, Lyon     On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline




         A measure of lexical quality for the Web should be
         independent of the size of the text or the number of pages in
         a website, to be able to compare this measure across
         documents, websites or different web segments




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline

                            Compute the rate of spelling                           Hard to compute in the
                            errors (number of misspellings/                        context of the Web
                            total number of words)

                            Use a sample of words and use
                            the rate of spelling errors of
                                                                                     Not trivial to compute
Possible                    t h o s e i n d i v i d u a l wo r d s t o
                                                                                     in the Web
                            maintain independence of the
Alternatives
                            text size

                                           (a) the numb er of                     (b) There might be more than
                                           va r i a t i o n s i n c r e a s e s   one correct word at the same
                                           exponentially with the                 distance of errors for a given
                                           number of errors                       misspelled word


                             Find words that are frequent and that also have a frequent misspell,
                             using that occurrence ratio as a proxy of the exact misspell rate

Baeza-Yates, R. and Rello, L.       Web Quality 2011, Lyon              On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline

      We approximate the word rate of spelling errors just dividing by the number of
      correct occurrences

       Hence, we define our measure of lexical quality as the average rate of the most
       common misspell for a set of words. That is, given a set of words W, we compute the
       relative ratio of the most common misspell to the correct spelling averaged over this
       word sample scaled by 100 to obtain values around 1.

       That is:




    — A lower value of LQ implies better lexical quality, being 0 perfect quality
    — We estimate df by searching each word only in the English pages of a search engine
    — The relative order of the measure will hardly change as the size of the set grows



Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon    On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline


       Words Selection Criteria
                                             English: WE
         Two sets of ten words
                                             Spanish: WS


                                (1) Frequent


          Conditions            (2) High misspelling ratio

                                (3) Non ambiguous, e.g. a proper name, acronym or a
                                     foreign word




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline


                         WE                                          WS

           album                *albun                    entonces            *entocnes
           always               *alwasy                   haciendo            *haceindo
           around               *arround                  hombre              *honbre
           because              *becuase                  momento             *momemto
           enough               *enoguh                   perfecto            *pefecto
           everything           *everyhting               porque              *porqeu
           having               *haveing                  pueden              *peuden
           problem              *problen                  siempre             *siemrpe
           remember             *remenber                 tengo               *tenog
           working              *workig                   vamos               *vamso



Baeza-Yates, R. and Rello, L.    Web Quality 2011, Lyon       On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline

     Ten words seem to be enough




 Note that for both languages the curves are quite similar and although LQ is not comparable
 across languages, in our case the results will be of the same order of magnitude

Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Measuring Lexical Quality
                                   Outline

     LQ gives Independent Information

    We computed the Pearson correlation in the top 13 common websites of ComScore
    unique visitors in USA (December 2011) and the Alexa.com reach (February 2012)
    for LQ, Alexa reach, number of pages in websites (by Google), number of
    in-links (by Alexa), and ComScore unique visitors.




    This shows that more content implies a higher misspelling rate and that web traffic
    does not imply better lexical quality.

    Therefore, we believe that LQ is a good estimator of the lexical quality of a website.
Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Lexical Quality of the Web
                                     Outline

   But the Search Engine Matters


     Search engine counts are never exact, but all of them are given by the same
     estimation algorithm, so the results are still valid to compare different web
     segments.

     Using Google we obtained that LQ for the English Web in March of 2011 was 0.047

     Using exact counts for the English pages for Yahoo! in March of 2011 was 0.099

     Using the sampling technique of Bar-Yossef & Gurevich to obtain a set of 28,000
     web pages (68% in English) we got 0.037

     All of these values have the same order of magnitude




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Lexical Quality of the Web
                                     Outline

   Time also Matters




     Correlation among years for the same search engine is not high due to the intense
     dynamics of web content

     Notice that the lexical quality is getting worse due to many factors (Web 2.0, new
     users, etc.)




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Lexical Quality of the Major Internet Domains
                           Outline

     In English




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Lexical Quality of the Major Internet Domains
                           Outline

     ... and in Spanish




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Lexical Quality of the Social Media
                                 Outline

    In English and Spanish




        In Flickr the lexical quality is better than in the Web. An explanation of this
        could be that texts in Flickr are short (e.g. tags) and our words are long.

        The order is a bit different in Spanish, probably due that some of those sites
        are more popular in English than in Spanish.


Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Geographical Distribution of the Lexical Quality
                          Outline


    We have taken into account the countries which have the highest populations of native
    English and Spanish speakers.

    In English in descending order they are: United States (215 M), United Kingdom (58.1
    M), Canada (17.7 M), Australia (15.6 M), Nigeria (4 M), Ireland (3.8 M), South Africa
    (3.7 M) and New Zealand (3.6 M).

    We have also added to our group of countries, India (86.1 M) and Philippines (44 M),
    where English as a second language is widespread.

    In Spanish the countries are: Mexico (104.1 M), Colombia (45.9 M), Spain (42.0 M),
    Argentina (36.3 M), Venezuela (28.4 M), Peru (25.3 M), Chile (17.1 M), Ecuador (11.9
    M), Cuba (11.2 M) and Dominican Republic (10.0 M).

    United States is not included in spite that Spanish is spoken by a large population


Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Geographical Distribution of the Lexical Quality
                          Outline




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Geographical Distribution of the Lexical Quality
                          Outline

    USA, Nigeria and India have the highest lexical quality. This can be explained by the
    high education level of users, as in India and Nigeria only 6.9% and 28.9% of their
    respective populations have Internet access. In addition, websites written in English in
    these countries tend to be official websites since English is an official language used in
    education, government and business, but is not the most common language. In the
    USA, the domain .us is less frequent than .com or .net, but USA has the highest
    number of Internet users.

    South Africa and Philippines have also a considerably high lexical quality considering
    the co-existing varieties or dialects of English in those countries.

    We observe a common trend between lower lexical quality and higher Internet access
    rate in Canada, Australia, United Kingdom, and New Zealand. An explanation of this
    could be the impact of social media in countries where Internet penetration is higher.

    In Spanish we can notice that the lexical quality in all countries is better than the Web
    average.

Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Lexical Quality in English and Spanish
                               Outline

     In English and Spanish




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Conclusions
                                        Outline

   • Our results show that the correlation between lexical quality and domain quality
   is high, and that the geographical distribution of lexical quality show the impact of
   business web pages and number of users among English speaking countries.

   • We speculate that the low LQ in countries where social media has a greater
   impact is related to a greater amount of UGC in their websites. But, as we use a
   small number of misspells, we may not to capture the real LQ in those websites.
   Hence, a tailored set of words might be needed for some social media sites.

   • LQ can be used as a feature to assess web content quality or it could help to
   estimate the understandability of a text in accessibility practices.

   • Our results show that it is important to analyze LQ periodically in the Web

   • Future work will include to validate further our results regarding our lexical
   quality measure, as well as improving the measure itself.
Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web
Outline




                  Thank you for your attention

                                     Questions?




Baeza-Yates, R. and Rello, L.   Web Quality 2011, Lyon   On Measuring the Lexical Quality of the Web

More Related Content

Viewers also liked

How to design a great speech
How to design a great speechHow to design a great speech
How to design a great speechHugh Culver
 
10 Wise Points To Ponder On
10 Wise Points To Ponder On10 Wise Points To Ponder On
10 Wise Points To Ponder OnOH TEIK BIN
 
Analysis of Barack Obama Election Campaign by Alexander Muehr
Analysis of Barack Obama Election Campaign by Alexander MuehrAnalysis of Barack Obama Election Campaign by Alexander Muehr
Analysis of Barack Obama Election Campaign by Alexander MuehrAlexander Muehr
 
Basic debating skills
Basic debating skillsBasic debating skills
Basic debating skillsjtoma84
 
Content marketing guidelines 2016 2017
Content marketing guidelines 2016  2017Content marketing guidelines 2016  2017
Content marketing guidelines 2016 2017Steven Van Belleghem
 
Fjord Trends 2016
Fjord Trends 2016Fjord Trends 2016
Fjord Trends 2016Fjord
 
The Content Marketer’s A to-Z Guide to Google Analytics
The Content Marketer’s A to-Z Guide to Google AnalyticsThe Content Marketer’s A to-Z Guide to Google Analytics
The Content Marketer’s A to-Z Guide to Google AnalyticsBarry Feldman
 
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會Jason Cheng
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?Maciej Lasyk
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
The Great State of Design with CSS Grid Layout and Friends
The Great State of Design with CSS Grid Layout and FriendsThe Great State of Design with CSS Grid Layout and Friends
The Great State of Design with CSS Grid Layout and FriendsStacy Kvernmo
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your BusinessBarry Feldman
 

Viewers also liked (17)

How to design a great speech
How to design a great speechHow to design a great speech
How to design a great speech
 
Kmap
KmapKmap
Kmap
 
10 Wise Points To Ponder On
10 Wise Points To Ponder On10 Wise Points To Ponder On
10 Wise Points To Ponder On
 
Animatronics
AnimatronicsAnimatronics
Animatronics
 
Analysis of Barack Obama Election Campaign by Alexander Muehr
Analysis of Barack Obama Election Campaign by Alexander MuehrAnalysis of Barack Obama Election Campaign by Alexander Muehr
Analysis of Barack Obama Election Campaign by Alexander Muehr
 
Tom's TEFL - Time And Daily Routine
Tom's TEFL - Time And Daily RoutineTom's TEFL - Time And Daily Routine
Tom's TEFL - Time And Daily Routine
 
Debate
DebateDebate
Debate
 
Atomic design
Atomic designAtomic design
Atomic design
 
Basic debating skills
Basic debating skillsBasic debating skills
Basic debating skills
 
Content marketing guidelines 2016 2017
Content marketing guidelines 2016  2017Content marketing guidelines 2016  2017
Content marketing guidelines 2016 2017
 
Fjord Trends 2016
Fjord Trends 2016Fjord Trends 2016
Fjord Trends 2016
 
The Content Marketer’s A to-Z Guide to Google Analytics
The Content Marketer’s A to-Z Guide to Google AnalyticsThe Content Marketer’s A to-Z Guide to Google Analytics
The Content Marketer’s A to-Z Guide to Google Analytics
 
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
The Great State of Design with CSS Grid Layout and Friends
The Great State of Design with CSS Grid Layout and FriendsThe Great State of Design with CSS Grid Layout and Friends
The Great State of Design with CSS Grid Layout and Friends
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Ricardo Baeza-Yates, Luz Rello-On Measuring the Lexical Quality of the Web-WICOW/AIRWeb 2012

  • 1. On Measuring the Lexical Quality of the Web Ricardo Baeza-Yates Luz Rello Yahoo! Research & Web Research & NLP Groups Web Research Group, Universitat Pompeu Fabra Universitat Pompeu Fabra Barcelona, Spain Barcelona, Spain WICOW/AIRWeb Workshop on Web Quality -- April 16, 2012, Lyon
  • 2. Outline Outline — Motivation — Related Work — English — Measuring Lexical Quality — Spanish — the Web — major Internet domains — Lexical Quality — social media — geographical distribution — Conclusions Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 3. Motivation Outline Some Facts... Measuring the quality of a web page is one of the key problems for web search engines Intrinsic quality depends on semantic quality, which is very hard to measure Many proxies for the real quality were proposed first in information retrieval based on the use of words and later in the Web, using link analysis and click-through data (R. Baeza-Yates & B. Ribeiro-Neto, 2011) Previous work had shown that there is a correlation between spelling errors and web data content quality (Gelman & Barletta, WICOW 2008) Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 4. Related Work Outline Content: — accuracy, source reputation, objectivity, highly current, ... e.g. spam detection or community feedback, user source credibility. interactions, click counts, ... Web Quality Lexical Quality Representation: — legibility, spelling errors, grammatical errors, ... Spelling error rate as a They use a set of ten metric to indicate the frequently misspelled degree of quality of words and hit counts of a websites search engine (Gelman & Barletta, WICOW 2008) Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 5. Lexical Quality Outline Lexical quality mainly refers to the degree of excellence of words in a text Lexical It impacts the reader’s understanding and it is also related to Quality textual accessibility, as text with errors is read slower by all people (Rello & Baeza-Yates, WWW 2012 poster) A lexical representation has high quality to the extent that it has a fully specified orthographic representation (a spelling) and redundant phonological representations (one from spoken language and one recoverable from orthographies-to-phonological mapping) (Perfetti & Hart, 2002) Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 6. Measuring Lexical Quality Outline A measure of lexical quality for the Web should be independent of the size of the text or the number of pages in a website, to be able to compare this measure across documents, websites or different web segments Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 7. Measuring Lexical Quality Outline Compute the rate of spelling Hard to compute in the errors (number of misspellings/ context of the Web total number of words) Use a sample of words and use the rate of spelling errors of Not trivial to compute Possible t h o s e i n d i v i d u a l wo r d s t o in the Web maintain independence of the Alternatives text size (a) the numb er of (b) There might be more than va r i a t i o n s i n c r e a s e s one correct word at the same exponentially with the distance of errors for a given number of errors misspelled word Find words that are frequent and that also have a frequent misspell, using that occurrence ratio as a proxy of the exact misspell rate Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 8. Measuring Lexical Quality Outline We approximate the word rate of spelling errors just dividing by the number of correct occurrences Hence, we define our measure of lexical quality as the average rate of the most common misspell for a set of words. That is, given a set of words W, we compute the relative ratio of the most common misspell to the correct spelling averaged over this word sample scaled by 100 to obtain values around 1. That is: — A lower value of LQ implies better lexical quality, being 0 perfect quality — We estimate df by searching each word only in the English pages of a search engine — The relative order of the measure will hardly change as the size of the set grows Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 9. Measuring Lexical Quality Outline Words Selection Criteria English: WE Two sets of ten words Spanish: WS (1) Frequent Conditions (2) High misspelling ratio (3) Non ambiguous, e.g. a proper name, acronym or a foreign word Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 10. Measuring Lexical Quality Outline WE WS album *albun entonces *entocnes always *alwasy haciendo *haceindo around *arround hombre *honbre because *becuase momento *momemto enough *enoguh perfecto *pefecto everything *everyhting porque *porqeu having *haveing pueden *peuden problem *problen siempre *siemrpe remember *remenber tengo *tenog working *workig vamos *vamso Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 11. Measuring Lexical Quality Outline Ten words seem to be enough Note that for both languages the curves are quite similar and although LQ is not comparable across languages, in our case the results will be of the same order of magnitude Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 12. Measuring Lexical Quality Outline LQ gives Independent Information We computed the Pearson correlation in the top 13 common websites of ComScore unique visitors in USA (December 2011) and the Alexa.com reach (February 2012) for LQ, Alexa reach, number of pages in websites (by Google), number of in-links (by Alexa), and ComScore unique visitors. This shows that more content implies a higher misspelling rate and that web traffic does not imply better lexical quality. Therefore, we believe that LQ is a good estimator of the lexical quality of a website. Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 13. Lexical Quality of the Web Outline But the Search Engine Matters Search engine counts are never exact, but all of them are given by the same estimation algorithm, so the results are still valid to compare different web segments. Using Google we obtained that LQ for the English Web in March of 2011 was 0.047 Using exact counts for the English pages for Yahoo! in March of 2011 was 0.099 Using the sampling technique of Bar-Yossef & Gurevich to obtain a set of 28,000 web pages (68% in English) we got 0.037 All of these values have the same order of magnitude Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 14. Lexical Quality of the Web Outline Time also Matters Correlation among years for the same search engine is not high due to the intense dynamics of web content Notice that the lexical quality is getting worse due to many factors (Web 2.0, new users, etc.) Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 15. Lexical Quality of the Major Internet Domains Outline In English Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 16. Lexical Quality of the Major Internet Domains Outline ... and in Spanish Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 17. Lexical Quality of the Social Media Outline In English and Spanish In Flickr the lexical quality is better than in the Web. An explanation of this could be that texts in Flickr are short (e.g. tags) and our words are long. The order is a bit different in Spanish, probably due that some of those sites are more popular in English than in Spanish. Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 18. Geographical Distribution of the Lexical Quality Outline We have taken into account the countries which have the highest populations of native English and Spanish speakers. In English in descending order they are: United States (215 M), United Kingdom (58.1 M), Canada (17.7 M), Australia (15.6 M), Nigeria (4 M), Ireland (3.8 M), South Africa (3.7 M) and New Zealand (3.6 M). We have also added to our group of countries, India (86.1 M) and Philippines (44 M), where English as a second language is widespread. In Spanish the countries are: Mexico (104.1 M), Colombia (45.9 M), Spain (42.0 M), Argentina (36.3 M), Venezuela (28.4 M), Peru (25.3 M), Chile (17.1 M), Ecuador (11.9 M), Cuba (11.2 M) and Dominican Republic (10.0 M). United States is not included in spite that Spanish is spoken by a large population Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 19. Geographical Distribution of the Lexical Quality Outline Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 20. Geographical Distribution of the Lexical Quality Outline USA, Nigeria and India have the highest lexical quality. This can be explained by the high education level of users, as in India and Nigeria only 6.9% and 28.9% of their respective populations have Internet access. In addition, websites written in English in these countries tend to be official websites since English is an official language used in education, government and business, but is not the most common language. In the USA, the domain .us is less frequent than .com or .net, but USA has the highest number of Internet users. South Africa and Philippines have also a considerably high lexical quality considering the co-existing varieties or dialects of English in those countries. We observe a common trend between lower lexical quality and higher Internet access rate in Canada, Australia, United Kingdom, and New Zealand. An explanation of this could be the impact of social media in countries where Internet penetration is higher. In Spanish we can notice that the lexical quality in all countries is better than the Web average. Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 21. Lexical Quality in English and Spanish Outline In English and Spanish Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 22. Conclusions Outline • Our results show that the correlation between lexical quality and domain quality is high, and that the geographical distribution of lexical quality show the impact of business web pages and number of users among English speaking countries. • We speculate that the low LQ in countries where social media has a greater impact is related to a greater amount of UGC in their websites. But, as we use a small number of misspells, we may not to capture the real LQ in those websites. Hence, a tailored set of words might be needed for some social media sites. • LQ can be used as a feature to assess web content quality or it could help to estimate the understandability of a text in accessibility practices. • Our results show that it is important to analyze LQ periodically in the Web • Future work will include to validate further our results regarding our lexical quality measure, as well as improving the measure itself. Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web
  • 23. Outline Thank you for your attention Questions? Baeza-Yates, R. and Rello, L. Web Quality 2011, Lyon On Measuring the Lexical Quality of the Web