Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mike Thelwall: Introduction to Webometrics


Published on

Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.

More information available at:

Published in: Technology
  • Be the first to comment

Mike Thelwall: Introduction to Webometrics

  1. 1. Introduction to Webometrics Mike Thelwall @mikethelwall Professor of Information Science, Statistical Cybermetrics Research Gp, University of Wolverhampton
  2. 2. Reminder of pre-workshop task Delegates were asked to join YouTube and leave comments and replies to earlier comments on the video: Department of Library and Information Science, Delhi These contributions will form part of the discussion at the end of the session, and include reference to the self-declared age and gender information from YouTube.
  3. 3. Overview of “Webometrics” <ul><li>What is Webometrics? </li></ul><ul><ul><li>Gathering, processing and analysing large scale data from the web (web pages, hyperlinks, blogs, Web 2.0) for many purposes that include online communication </li></ul></ul><ul><li>What can Webometrics offer other researchers? </li></ul><ul><ul><li>Software to gather data from web sites, search engines, social network sites and blogs; methods to extract useful patterns </li></ul></ul><ul><li>Common data sources </li></ul><ul><ul><li>Webometric Analyst software: Twitter, YouTube, the Web, Technorati, Bing </li></ul></ul><ul><ul><li>Bespoke: Any resource with an API, page scraping of other sites, SocSciBot web crawler </li></ul></ul>
  4. 4. 1. Background: Webometrics <ul><li>Webometrics is about gathering data on the Web, and measuring aspects of the Web: </li></ul><ul><ul><li>web sites </li></ul></ul><ul><ul><li>web pages </li></ul></ul><ul><ul><li>hyperlinks </li></ul></ul><ul><ul><li>YouTube video commenter networks </li></ul></ul><ul><ul><li>web search engine results </li></ul></ul><ul><ul><li>MySpace Friend networks </li></ul></ul><ul><ul><li>Twitter or blog trends </li></ul></ul><ul><li>… for varied social science purposes </li></ul>
  5. 5. New problems: Web-based phenomena <ul><li>Webometrics can analyse online academic communication </li></ul><ul><ul><li>Why do academic web sites interlink? </li></ul></ul><ul><ul><li>Which academic web sites interlink? </li></ul></ul><ul><ul><li>What academic interlinking patterns exist? </li></ul></ul><ul><ul><li>Which web sites/groups/documents have the most online impact, and why? </li></ul></ul>
  6. 6. Old problems: Offline phenomena reflected online <ul><li>Some offline phenomena have measurable online reflections </li></ul><ul><ul><li>International communication </li></ul></ul><ul><ul><li>Inter-university collaboration </li></ul></ul><ul><ul><li>University-business collaboration </li></ul></ul><ul><ul><li>The impact or spread of ideas </li></ul></ul><ul><ul><li>Public opinion about science </li></ul></ul>
  7. 7. Example: The online impact of research groups (NetReAct)
  8. 8. Normalised linking, smallest countries removed Geopolitical connected Sweden Finland Norway UK Germany Austria Switzerland Poland Italy Belgium Spain France NL Example: Links between EU universities
  9. 9. International biofuels research network
  10. 10. Data Gathering/Processing Tools <ul><li>Webometric Analyst – web citations, web text, YouTube, Flickr, Technorati </li></ul><ul><ul><li>Submits thousands of queries to Bing and summarises the results in standard ways </li></ul></ul><ul><li>SocSciBot – links, web text </li></ul><ul><ul><li>Web Crawler & </li></ul></ul><ul><ul><li>analyser </li></ul></ul>
  11. 11. 2. altmetrics in traditional research evaluation <ul><li>Altmetrics can supplement traditional citation impact with non-traditional online impact </li></ul><ul><ul><li>E.g., educational, discussion-based </li></ul></ul><ul><li>Often weaker than citation data but useful for research groups that have non-standard types of impacts </li></ul>
  12. 12. The Integrated Online Impact Indicator (IOI) <ul><li>Combines a range of online sources into one indicator </li></ul><ul><ul><li>Google Scholar + </li></ul></ul><ul><ul><li>Google Books + </li></ul></ul><ul><ul><li>Course reading lists + </li></ul></ul><ul><ul><li>Google Blogs + </li></ul></ul><ul><ul><li>PowerPoint presentations = IOI </li></ul></ul><ul><li>OR select individual separate components </li></ul>Invented by Kayvan Kousha
  13. 13. New source 1: Google Scholar <ul><li>Wider evidence of academic impact </li></ul><ul><li>Wider types of academic publications, some non-academic publications </li></ul><ul><li>Not reliable </li></ul><ul><li>Coverage variable </li></ul><ul><li>Can’t be automatically queried </li></ul><ul><li>Free </li></ul>
  14. 14. New source 2: Google Books <ul><li>Books typically not indexed in WoS or Scopus </li></ul><ul><li>Relevant in book-based disciplines (arts, humanities, some social sciences) </li></ul><ul><li>Reliability unknown but probably not good </li></ul><ul><li>Coverage variable </li></ul><ul><li>Can be automatically queried </li></ul><ul><li>Free [ Clifford Lynch ] </li></ul>
  15. 15. New source 3: Course reading lists <ul><li>Evidence of educational impact </li></ul><ul><li>Can automatically construct queries to detect individual articles in online syllabuses </li></ul><ul><li>Get results via advanced Google/Yahoo/Live Search queries </li></ul><ul><li>Works for most articles </li></ul><ul><ul><li>Fails for short common article titles </li></ul></ul>
  16. 16. New source 4: Blogs <ul><li>Evidence of impact on discussions </li></ul><ul><li>Educational impact, public dissemination evidence, academic impact in discursive subjects? </li></ul><ul><li>Not possible to automate in the largest database (Google Blogs)? </li></ul><ul><li>Not a well researched area </li></ul>
  17. 17. New source 5: PowerPoint Presentations <ul><li>Evidence of educational/scholarly impact </li></ul><ul><li>Especially relevant for discursive subjects? </li></ul><ul><li>Automated Live Search/Yahoo advanced queries </li></ul><ul><li>IOI = a*Scholar + b*PowerPoint + c*Blogs + d* Syllabus + e* Books </li></ul>Or use qualitative analyses of the different sources
  18. 18. 2. Sentiment Strength Detection in the Social Web with SentiStrength <ul><li>Detect positive and negative sentiment strength in short informal text </li></ul><ul><ul><li>Develop workarounds for lack of standard grammar and spelling </li></ul></ul><ul><ul><li>Harness emotion expression forms unique to MySpace or CMC (e.g., :-) or haaappppyyy!!!) </li></ul></ul><ul><ul><li>Classify simultaneously as positive 1-5 AND negative 1-5 sentiment </li></ul></ul>Thelwall, M., Buckley, K., & Paltoglou, G. (in press).  Sentiment strength detection for the social Web . Journal of the American Society for Information Science and Technology . Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010).  Sentiment strength detection in short informal text . Journal of the American Society for Information Science and Technology , 61(12), 2544-2558.
  19. 19. SentiStrength Algorithm - Core <ul><li>List of 2,489 positive and negative sentiment term stems and strengths (1 to 5), e.g. </li></ul><ul><ul><li>ache = -2, dislike = -3, hate=-4, excruciating -5 </li></ul></ul><ul><ul><li>encourage = 2, coolest = 3, lover = 4 </li></ul></ul><ul><li>Sentiment strength is highest in sentence; or highest sentence if multiple sentences </li></ul>
  20. 20. <ul><li>My legs ache. </li></ul><ul><li>You are the coolest. </li></ul><ul><li>I hate Paul but encourage him. </li></ul>-2 3 -4 2 1, -2 positive, negative 3, -1 2, -4
  21. 21. Extra sentiment methods <ul><li>spelling correction nicce -> nice </li></ul><ul><li>booster words alter strength very happy </li></ul><ul><li>negating words flip emotions not nice </li></ul><ul><li>repeated letters boost sentiment/+ve niiiice </li></ul><ul><li>emoticon list :) =+2 </li></ul><ul><li>exclamation marks count as +2 unless –ve hi! </li></ul><ul><li>repeated punctuation boosts sentiment good!!! </li></ul><ul><li>negative emotion ignored in questions u h8 me? </li></ul><ul><li>Sentiment idiom list shock horror = -2 </li></ul>Online as
  22. 22. Tests against human coders SentiStrength agrees with humans as much as they agree with each other 1 is perfect agreement, 0 is random agreement Data set Positive scores -correlation with humans Negative scores -correlation with humans YouTube 0.589 0.521 MySpace 0.647 0.599 Twitter 0.541 0.499 Sports forum 0.567 0.541 news 0.352 0.552 BBC forums 0.296 0.591 All 6 data sets 0.556 0.565
  23. 23. Why the bad results for BBC? (and Digg) <ul><li>Irony, sarcasm and expressive language e.g., </li></ul><ul><ul><li>David Cameron must be very happy that I have lost my job. </li></ul></ul><ul><ul><li>It is really interesting that David Cameron and most of his ministers are millionaires. </li></ul></ul><ul><ul><li>Your argument is a joke . </li></ul></ul>$
  24. 24. 2. Twitter – sentiment in major media events <ul><li>Analysis of a corpus of 1 month of English Twitter posts (35 Million, from 2.7M accounts) </li></ul><ul><li>Automatic detection of spikes (events) </li></ul><ul><li>Assessment of whether sentiment changes during major media events </li></ul>
  25. 25. Automatically-identified Twitter spikes 9 Mar 2010 9 Feb 2010 Proportion of tweets mentioning keyword Thelwall, M., Buckley, K., & Paltoglou, G. (2011).  Sentiment in Twitter events .  Journal of the American Society for Information Science and Technology,  62(2), 406-418.
  26. 26. Chile matching posts Sentiment strength Subj. Increase in –ve sentiment strength 9 Feb 2010 9 Feb 2010 Date and time Date and time 9 Mar 2010 9 Mar 2010 Av. +ve sentiment Just subj. Av. -ve sentiment Just subj. Proportion of tweets mentioning Chile
  27. 27. #oscars % matching posts Sentiment strength Subj. Increase in –ve sentiment strength Date and time Date and time 9 Feb 2010 9 Feb 2010 9 Mar 2010 9 Mar 2010 Av. +ve sentiment Just subj. Av. -ve sentiment Just subj. Proportion of tweets mentioning the Oscars
  28. 28. Sentiment and spikes <ul><li>Statistical analysis of top 30 events: </li></ul><ul><ul><li>Strong evidence that higher volume hours have stronger negative sentiment than lower volume hours </li></ul></ul><ul><ul><li>No evidence that higher volume hours have different positive sentiment strength than lower volume hours </li></ul></ul><ul><li>=> Spikes are typified by small increases in negativity </li></ul>
  29. 29. 3. YouTube Video comments <ul><li>1000 comm. per video via Webometric Analyst (or the YouTube API) </li></ul><ul><li>Good source of social web text data </li></ul><ul><li>Analysis of all comments on a pseudo-random sample of 35,347 videos with < 1000 comments </li></ul>
  30. 31. Using Webometric Analyst <ul><li>Download free from </li></ul><ul><li>Start, select classic interface, YouTube Tab </li></ul>
  31. 32. Reply networks <ul><li>Illustrate the replies to a YouTube video in network form </li></ul><ul><li>Reveal age and gender of posters </li></ul><ul><li>Reveal patterns of discussion in the replies (if any) </li></ul><ul><li>Take up to 25 minutes to make per video with Webometric Analyst </li></ul>
  32. 33. Reply network Extended core interactions 2x2=5 video Nodes (people) blue = male pink = female Arrows (replies) red = happy replies black = angry replies
  33. 34. <ul><li>The 10 most ridiculous Black Metal videos </li></ul><ul><li>A very sparse reply network </li></ul><ul><li>Nodes are mostly connected in 2s and 3s </li></ul>
  34. 35. <ul><li>Black Metal vs. Deathcore </li></ul><ul><li>Denser reply network </li></ul><ul><li>On-going debates and </li></ul><ul><li>contentious issues </li></ul>
  35. 36. Other networks <ul><li>Connections can be </li></ul><ul><li>Friendship in YouTube (must be reciprocal) </li></ul><ul><li>Subscription in YouTube (non-reciprocal, based upon interest in video content) </li></ul><ul><li>Friends in common in YouTube </li></ul><ul><ul><li>suggests factors in common (e.g., bands) rather than people in common </li></ul></ul><ul><li>Subscriptions in common in YouTube </li></ul><ul><ul><li>again suggests factors in common (e.g., bands) rather than people in common </li></ul></ul>
  36. 37. Very sparse Friend network = common
  37. 38. Common Friends network – with a densely connected core
  38. 39. Large-scale analysis of YouTube <ul><li>Purpose: to discover patterns, norms and unusual behaviour in YouTube </li></ul><ul><li>Method: </li></ul><ul><li>Generate a large sample of YouTube videos </li></ul><ul><ul><li>Running searches for many terms from a large word list </li></ul></ul><ul><ul><li>Selecting a video at random from each set of results </li></ul></ul><ul><li>Extract properties of the videos and commenters </li></ul><ul><li>Calculate averages and distributions </li></ul><ul><li>Examine extreme videos </li></ul>
  39. 40. Comments are mainly positive
  40. 42. Controversial and non-controversial ?
  41. 43. Conclusions <ul><li>Sentiment analysis allows large-scale social web analysis </li></ul><ul><li>YouTube videos can be analysed easily with a network perspective on their comments </li></ul><ul><li> - Webometric Analyst </li></ul><ul><li>Can make a reply network of the discussions of video </li></ul><ul><li>_-OAsF9uRfc by following the instructions at </li></ul>