Mike Thelwall: Introduction to Webometrics

3,900 views

Published on

Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.

More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,900
On SlideShare
0
From Embeds
0
Number of Embeds
1,034
Actions
Shares
0
Downloads
83
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Example of highly popular amateur video triggering lots of discussions
  • Webometric Analyst uses the YouTube API to download information on one or more videos. [Is Windows only]
  • Friend network
  • Friends in common
  • Intelligent design debate
  • Arjona link
  • Mike Thelwall: Introduction to Webometrics

    1. 1. Introduction to Webometrics Mike Thelwall @mikethelwall Professor of Information Science, Statistical Cybermetrics Research Gp, University of Wolverhampton
    2. 2. Reminder of pre-workshop task Delegates were asked to join YouTube and leave comments and replies to earlier comments on the video: Department of Library and Information Science, Delhi http://www.youtube.com/watch?v=_-OAsF9uRfc These contributions will form part of the discussion at the end of the session, and include reference to the self-declared age and gender information from YouTube.
    3. 3. Overview of “Webometrics” <ul><li>What is Webometrics? </li></ul><ul><ul><li>Gathering, processing and analysing large scale data from the web (web pages, hyperlinks, blogs, Web 2.0) for many purposes that include online communication </li></ul></ul><ul><li>What can Webometrics offer other researchers? </li></ul><ul><ul><li>Software to gather data from web sites, search engines, social network sites and blogs; methods to extract useful patterns </li></ul></ul><ul><li>Common data sources </li></ul><ul><ul><li>Webometric Analyst software: Twitter, YouTube, the Web, Technorati, Bing </li></ul></ul><ul><ul><li>Bespoke: Any resource with an API, page scraping of other sites, SocSciBot web crawler </li></ul></ul>http://lexiurl.wlv.ac.uk
    4. 4. 1. Background: Webometrics <ul><li>Webometrics is about gathering data on the Web, and measuring aspects of the Web: </li></ul><ul><ul><li>web sites </li></ul></ul><ul><ul><li>web pages </li></ul></ul><ul><ul><li>hyperlinks </li></ul></ul><ul><ul><li>YouTube video commenter networks </li></ul></ul><ul><ul><li>web search engine results </li></ul></ul><ul><ul><li>MySpace Friend networks </li></ul></ul><ul><ul><li>Twitter or blog trends </li></ul></ul><ul><li>… for varied social science purposes </li></ul>
    5. 5. New problems: Web-based phenomena <ul><li>Webometrics can analyse online academic communication </li></ul><ul><ul><li>Why do academic web sites interlink? </li></ul></ul><ul><ul><li>Which academic web sites interlink? </li></ul></ul><ul><ul><li>What academic interlinking patterns exist? </li></ul></ul><ul><ul><li>Which web sites/groups/documents have the most online impact, and why? </li></ul></ul>
    6. 6. Old problems: Offline phenomena reflected online <ul><li>Some offline phenomena have measurable online reflections </li></ul><ul><ul><li>International communication </li></ul></ul><ul><ul><li>Inter-university collaboration </li></ul></ul><ul><ul><li>University-business collaboration </li></ul></ul><ul><ul><li>The impact or spread of ideas </li></ul></ul><ul><ul><li>Public opinion about science </li></ul></ul>
    7. 7. Example: The online impact of research groups (NetReAct)
    8. 8. Normalised linking, smallest countries removed Geopolitical connected Sweden Finland Norway UK Germany Austria Switzerland Poland Italy Belgium Spain France NL Example: Links between EU universities
    9. 9. International biofuels research network
    10. 10. Data Gathering/Processing Tools <ul><li>Webometric Analyst – web citations, web text, YouTube, Flickr, Technorati </li></ul><ul><ul><li>Submits thousands of queries to Bing and summarises the results in standard ways </li></ul></ul><ul><li>SocSciBot – links, web text </li></ul><ul><ul><li>Web Crawler & </li></ul></ul><ul><ul><li>analyser </li></ul></ul>
    11. 11. 2. altmetrics in traditional research evaluation <ul><li>Altmetrics can supplement traditional citation impact with non-traditional online impact </li></ul><ul><ul><li>E.g., educational, discussion-based </li></ul></ul><ul><li>Often weaker than citation data but useful for research groups that have non-standard types of impacts </li></ul>
    12. 12. The Integrated Online Impact Indicator (IOI) <ul><li>Combines a range of online sources into one indicator </li></ul><ul><ul><li>Google Scholar + </li></ul></ul><ul><ul><li>Google Books + </li></ul></ul><ul><ul><li>Course reading lists + </li></ul></ul><ul><ul><li>Google Blogs + </li></ul></ul><ul><ul><li>PowerPoint presentations = IOI </li></ul></ul><ul><li>OR select individual separate components </li></ul>Invented by Kayvan Kousha
    13. 13. New source 1: Google Scholar <ul><li>Wider evidence of academic impact </li></ul><ul><li>Wider types of academic publications, some non-academic publications </li></ul><ul><li>Not reliable </li></ul><ul><li>Coverage variable </li></ul><ul><li>Can’t be automatically queried </li></ul><ul><li>Free </li></ul>
    14. 14. New source 2: Google Books <ul><li>Books typically not indexed in WoS or Scopus </li></ul><ul><li>Relevant in book-based disciplines (arts, humanities, some social sciences) </li></ul><ul><li>Reliability unknown but probably not good </li></ul><ul><li>Coverage variable </li></ul><ul><li>Can be automatically queried </li></ul><ul><li>Free [ Clifford Lynch ] </li></ul>
    15. 15. New source 3: Course reading lists <ul><li>Evidence of educational impact </li></ul><ul><li>Can automatically construct queries to detect individual articles in online syllabuses </li></ul><ul><li>Get results via advanced Google/Yahoo/Live Search queries </li></ul><ul><li>Works for most articles </li></ul><ul><ul><li>Fails for short common article titles </li></ul></ul>
    16. 16. New source 4: Blogs <ul><li>Evidence of impact on discussions </li></ul><ul><li>Educational impact, public dissemination evidence, academic impact in discursive subjects? </li></ul><ul><li>Not possible to automate in the largest database (Google Blogs)? </li></ul><ul><li>Not a well researched area </li></ul>
    17. 17. New source 5: PowerPoint Presentations <ul><li>Evidence of educational/scholarly impact </li></ul><ul><li>Especially relevant for discursive subjects? </li></ul><ul><li>Automated Live Search/Yahoo advanced queries </li></ul><ul><li>IOI = a*Scholar + b*PowerPoint + c*Blogs + d* Syllabus + e* Books </li></ul>Or use qualitative analyses of the different sources
    18. 18. 2. Sentiment Strength Detection in the Social Web with SentiStrength <ul><li>Detect positive and negative sentiment strength in short informal text </li></ul><ul><ul><li>Develop workarounds for lack of standard grammar and spelling </li></ul></ul><ul><ul><li>Harness emotion expression forms unique to MySpace or CMC (e.g., :-) or haaappppyyy!!!) </li></ul></ul><ul><ul><li>Classify simultaneously as positive 1-5 AND negative 1-5 sentiment </li></ul></ul>Thelwall, M., Buckley, K., & Paltoglou, G. (in press).  Sentiment strength detection for the social Web . Journal of the American Society for Information Science and Technology . Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010).  Sentiment strength detection in short informal text . Journal of the American Society for Information Science and Technology , 61(12), 2544-2558.
    19. 19. SentiStrength Algorithm - Core <ul><li>List of 2,489 positive and negative sentiment term stems and strengths (1 to 5), e.g. </li></ul><ul><ul><li>ache = -2, dislike = -3, hate=-4, excruciating -5 </li></ul></ul><ul><ul><li>encourage = 2, coolest = 3, lover = 4 </li></ul></ul><ul><li>Sentiment strength is highest in sentence; or highest sentence if multiple sentences </li></ul>
    20. 20. <ul><li>My legs ache. </li></ul><ul><li>You are the coolest. </li></ul><ul><li>I hate Paul but encourage him. </li></ul>-2 3 -4 2 1, -2 positive, negative 3, -1 2, -4
    21. 21. Extra sentiment methods <ul><li>spelling correction nicce -> nice </li></ul><ul><li>booster words alter strength very happy </li></ul><ul><li>negating words flip emotions not nice </li></ul><ul><li>repeated letters boost sentiment/+ve niiiice </li></ul><ul><li>emoticon list :) =+2 </li></ul><ul><li>exclamation marks count as +2 unless –ve hi! </li></ul><ul><li>repeated punctuation boosts sentiment good!!! </li></ul><ul><li>negative emotion ignored in questions u h8 me? </li></ul><ul><li>Sentiment idiom list shock horror = -2 </li></ul>Online as http://sentistrength.wlv.ac.uk/
    22. 22. Tests against human coders SentiStrength agrees with humans as much as they agree with each other 1 is perfect agreement, 0 is random agreement Data set Positive scores -correlation with humans Negative scores -correlation with humans YouTube 0.589 0.521 MySpace 0.647 0.599 Twitter 0.541 0.499 Sports forum 0.567 0.541 Digg.com news 0.352 0.552 BBC forums 0.296 0.591 All 6 data sets 0.556 0.565
    23. 23. Why the bad results for BBC? (and Digg) <ul><li>Irony, sarcasm and expressive language e.g., </li></ul><ul><ul><li>David Cameron must be very happy that I have lost my job. </li></ul></ul><ul><ul><li>It is really interesting that David Cameron and most of his ministers are millionaires. </li></ul></ul><ul><ul><li>Your argument is a joke . </li></ul></ul>$
    24. 24. 2. Twitter – sentiment in major media events <ul><li>Analysis of a corpus of 1 month of English Twitter posts (35 Million, from 2.7M accounts) </li></ul><ul><li>Automatic detection of spikes (events) </li></ul><ul><li>Assessment of whether sentiment changes during major media events </li></ul>
    25. 25. Automatically-identified Twitter spikes 9 Mar 2010 9 Feb 2010 Proportion of tweets mentioning keyword Thelwall, M., Buckley, K., & Paltoglou, G. (2011).  Sentiment in Twitter events .  Journal of the American Society for Information Science and Technology,  62(2), 406-418.
    26. 26. Chile matching posts Sentiment strength Subj. Increase in –ve sentiment strength 9 Feb 2010 9 Feb 2010 Date and time Date and time 9 Mar 2010 9 Mar 2010 Av. +ve sentiment Just subj. Av. -ve sentiment Just subj. Proportion of tweets mentioning Chile
    27. 27. #oscars % matching posts Sentiment strength Subj. Increase in –ve sentiment strength Date and time Date and time 9 Feb 2010 9 Feb 2010 9 Mar 2010 9 Mar 2010 Av. +ve sentiment Just subj. Av. -ve sentiment Just subj. Proportion of tweets mentioning the Oscars
    28. 28. Sentiment and spikes <ul><li>Statistical analysis of top 30 events: </li></ul><ul><ul><li>Strong evidence that higher volume hours have stronger negative sentiment than lower volume hours </li></ul></ul><ul><ul><li>No evidence that higher volume hours have different positive sentiment strength than lower volume hours </li></ul></ul><ul><li>=> Spikes are typified by small increases in negativity </li></ul>
    29. 29. 3. YouTube Video comments <ul><li>1000 comm. per video via Webometric Analyst (or the YouTube API) </li></ul><ul><li>Good source of social web text data </li></ul><ul><li>Analysis of all comments on a pseudo-random sample of 35,347 videos with < 1000 comments </li></ul>
    30. 31. Using Webometric Analyst <ul><li>Download free from http://lexiurl.wlv.ac.uk </li></ul><ul><li>Start, select classic interface, YouTube Tab </li></ul>
    31. 32. Reply networks <ul><li>Illustrate the replies to a YouTube video in network form </li></ul><ul><li>Reveal age and gender of posters </li></ul><ul><li>Reveal patterns of discussion in the replies (if any) </li></ul><ul><li>Take up to 25 minutes to make per video with Webometric Analyst </li></ul>
    32. 33. Reply network Extended core interactions 2x2=5 video Nodes (people) blue = male pink = female Arrows (replies) red = happy replies black = angry replies
    33. 34. <ul><li>The 10 most ridiculous Black Metal videos </li></ul><ul><li>A very sparse reply network </li></ul><ul><li>Nodes are mostly connected in 2s and 3s </li></ul>
    34. 35. <ul><li>Black Metal vs. Deathcore </li></ul><ul><li>Denser reply network </li></ul><ul><li>On-going debates and </li></ul><ul><li>contentious issues </li></ul>
    35. 36. Other networks <ul><li>Connections can be </li></ul><ul><li>Friendship in YouTube (must be reciprocal) </li></ul><ul><li>Subscription in YouTube (non-reciprocal, based upon interest in video content) </li></ul><ul><li>Friends in common in YouTube </li></ul><ul><ul><li>suggests factors in common (e.g., bands) rather than people in common </li></ul></ul><ul><li>Subscriptions in common in YouTube </li></ul><ul><ul><li>again suggests factors in common (e.g., bands) rather than people in common </li></ul></ul>
    36. 37. Very sparse Friend network = common
    37. 38. Common Friends network – with a densely connected core
    38. 39. Large-scale analysis of YouTube <ul><li>Purpose: to discover patterns, norms and unusual behaviour in YouTube </li></ul><ul><li>Method: </li></ul><ul><li>Generate a large sample of YouTube videos </li></ul><ul><ul><li>Running searches for many terms from a large word list </li></ul></ul><ul><ul><li>Selecting a video at random from each set of results </li></ul></ul><ul><li>Extract properties of the videos and commenters </li></ul><ul><li>Calculate averages and distributions </li></ul><ul><li>Examine extreme videos </li></ul>
    39. 40. Comments are mainly positive
    40. 42. Controversial and non-controversial ?
    41. 43. Conclusions <ul><li>Sentiment analysis allows large-scale social web analysis </li></ul><ul><li>YouTube videos can be analysed easily with a network perspective on their comments </li></ul><ul><li>http://lexiurl.wlv.ac.uk/ - Webometric Analyst </li></ul><ul><li>Can make a reply network of the discussions of video </li></ul><ul><li>_-OAsF9uRfc by following the instructions at http://lexiurl.wlv.ac.uk/searcher/youtubereplies.html </li></ul>

    ×