Webometrics 1.0 from AltaVista to Small Worlds and Genre Drift Lennart Björneborn Royal School of Library and Information ...
outline <ul><li>webometrics 1.0 </li></ul><ul><ul><li>birth of webometrics </li></ul></ul><ul><ul><li>early webometric res...
WWW = largest network    with available connectivity data Wood et al. (1995)
WWW = collaborative weaving =  macro-level aggregations   of  micro-level interactions = reflect social, cultural formatio...
= keep track of    ”the complex web of relationships    between people, programs,    machines and ideas”   (Tim Berners-Le...
birth of webometrics <ul><li>citation analogy </li></ul><ul><ul><li>link = implicit recommendation of webpage  </li></ul><...
birth of webometrics:   access to link data* linkdomain:norslis.net -site:norslis.net link:www.norslis.net -site:norslis.n...
linkdomain:norslis.net  -site:norslis.net
basic link terminology <ul><li>B has an  inlink  from A  : ~ citation </li></ul><ul><li>B has an  outlink  to C  : ~ refer...
some proposed web metrics <ul><li>Netometrics   (Bossy, 1995) </li></ul><ul><ul><li>supplement bibliometrics and scientome...
some related web science <ul><li>Web Mining   (e.g., Etzioni, 1996; Kosala & Blockeel, 2000) </li></ul><ul><li>Web Ecology...
webometrics <ul><li>the study of quantitative aspects of    the  construction  and  use   of  info.  resources ,  structur...
webometrics <ul><li>four main research areas of webometric concern: </li></ul><ul><ul><li>web page  content  analysis; </l...
web data collection <ul><li>non-standardized,  messy data </li></ul><ul><ul><li>due to diversified, distributed, dynamic w...
examples of webometric analysis <ul><li>powerlaw  distributions </li></ul><ul><ul><li>e.g. pages, outlinks, inlinks, visit...
http:// www.scit.wlv.ac.uk /~cm1993/ mtpublications.html
small-world link analysis Björneborn (2004).  Small-world link structures across an academic Web space:  A library and inf...
graph theory - Leonhard Euler (1707-1783), Königsberg (Wilson & Watkins  1990)
graph theory <ul><li>graph   = mathematical modeling of network </li></ul><ul><ul><li>directed graph: e.g. www </li></ul><...
graph theory applications <ul><li>graph theory used for mathematical modeling of networks </li></ul><ul><ul><li>e.g., biol...
social network analysis <ul><li>relations between actors in social network </li></ul><ul><li>sociometry  - 1930s (Moreno) ...
small-world  networks <ul><li>small-world =  highly clustered   +   short paths </li></ul><ul><ul><li>short distances thro...
<ul><li>most links connect similar topics   topical  clusters </li></ul><ul><li>small-world web   cross-topic  shortcuts...
<ul><li>main research question </li></ul><ul><li>what types of web  links ,  web  pages  and web  sites   function as  cro...
UK link data   2001 <ul><li>109 UK universities </li></ul><ul><ul><li>web crawler, Thelwall </li></ul></ul><ul><li>7669 su...
‘ corona’ graph model  reachability structures 1893   SCC Strongest Connected Component 96   IN-Tendrils connected from IN...
10 seed nodes  (stratified sampling in SCC component) 10  path nets  with all  shortest link paths  between five pairs of ...
.ac.uk .uk cfd.me.umist.ac.uk ercoftac.mech.surrey.ac.uk cajun.cs.nott.ac.uk ukoln.bath.ac.uk cs.man.ac.uk ashmol.ox.ac.uk...
path net = ‘mini’ small world transversal link path net  = all shortest link paths between two given nodes (subsites) netw...
some indicative findings <ul><li>findings not generalizable: small, stratified sample </li></ul><ul><li>however: indicativ...
small-world web implications <ul><li>small  local threads  in the shape of users’ links  affect how the  global web  is co...
webometric study:   genre connectivity  <ul><li>what role do web page genres play for  cohesion   and  reachability  on th...
<ul><li>352 links </li></ul><ul><li>281 source pages </li></ul>genre connectivity analysis  <ul><li>249 target pages </li>...
meta genres
genre pairs
web of genres genre network graph  extracted with  Pajek  software  ©  Björneborn
genre  connectivity <ul><li>academic web spaces = rich diversity of interlinked genres     = diversified link motivations ...
genre drift + topic drift <ul><li>topic clusters with genre diversity + genres with topical diversity </li></ul><ul><li>ch...
questions?
read more: Björneborn (2004).  Small-world link structures across an academic web space : A library and information scienc...
Upcoming SlideShare
Loading in …5
×

Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift

8,259 views
8,268 views

Published on

NORSLIS PhD course in informetrics, Umeå University, Sweden, 18 June 2008

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
8,259
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift

    1. 1. Webometrics 1.0 from AltaVista to Small Worlds and Genre Drift Lennart Björneborn Royal School of Library and Information Science [email_address] NORSLIS PhD course in informetrics Umeå 18.6.2008
    2. 2. outline <ul><li>webometrics 1.0 </li></ul><ul><ul><li>birth of webometrics </li></ul></ul><ul><ul><li>early webometric research </li></ul></ul><ul><li>two webometric studies </li></ul><ul><ul><li>small-world link analysis </li></ul></ul><ul><ul><ul><li>based on graph theory and social network analysis </li></ul></ul></ul><ul><ul><li>genre connectivity analysis </li></ul></ul>M.C. Escher: House of Stairs, 1951
    3. 3. WWW = largest network with available connectivity data Wood et al. (1995)
    4. 4. WWW = collaborative weaving = macro-level aggregations of micro-level interactions = reflect social, cultural formations Wood et al. (1995)
    5. 5. = keep track of ”the complex web of relationships between people, programs, machines and ideas” (Tim Berners-Lee, 1997) Wood et al. (1995) WWW
    6. 6. birth of webometrics <ul><li>citation analogy </li></ul><ul><ul><li>link = implicit recommendation of webpage </li></ul></ul><ul><ul><li>though also negative references </li></ul></ul><ul><li>’ Webometrics’ 1997 + ’Web Impact Factor’ 1998 </li></ul><ul><ul><li>Almind & Ingwersen (1997). Informetric analyses on the World Wide Web: methodological approaches to ‘webometrics’. </li></ul></ul><ul><ul><li>Ingwersen (1998). The calculation of Web impact factors. </li></ul></ul><ul><li>Google ’Page Rank’ 1998 </li></ul><ul><ul><li>exploit link structures: who receives many links from someone who also receives many links from someone who also … ? </li></ul></ul>
    7. 7. birth of webometrics: access to link data* linkdomain:norslis.net -site:norslis.net link:www.norslis.net -site:norslis.net (* cf. breakthrough of bibliometrics: access to citation data)
    8. 8. linkdomain:norslis.net -site:norslis.net
    9. 9. basic link terminology <ul><li>B has an inlink from A : ~ citation </li></ul><ul><li>B has an outlink to C : ~ reference </li></ul><ul><li>B has a selflink : ~ self-citation </li></ul><ul><li>C and D have co-inlinks from B : ~ co-citation </li></ul><ul><li>B and E have co-outlinks to D : ~ bibliographic coupling </li></ul>A B D E G F H C co-links (Björneborn 2004)
    10. 10. some proposed web metrics <ul><li>Netometrics (Bossy, 1995) </li></ul><ul><ul><li>supplement bibliometrics and scientometrics in observing “science in action” on the Internet </li></ul></ul><ul><li>Webometry (Abraham, 1996) </li></ul><ul><li>Internetometrics (Almind & Ingwersen, 1996) </li></ul><ul><li>Webometrics (Almind & Ingwersen, 1997) </li></ul><ul><li>Cybermetrics (journal started 1997 by Isidro Aguillo) </li></ul><ul><li>Web bibliometry (Chakrabarti et al., 2002) </li></ul>
    11. 11. some related web science <ul><li>Web Mining (e.g., Etzioni, 1996; Kosala & Blockeel, 2000) </li></ul><ul><li>Web Ecology (e.g., Pitkow, 1997; Chi et al., 1998; Huberman, 2001) </li></ul><ul><li>Cyber Geography (e.g., Girardin, 1995) </li></ul><ul><li>Cyber Cartography (e.g., Dodge, 1999) </li></ul><ul><li>Web Graph Analysis (e.g., Kleinberg et al., 1999; Broder et al., 2000) </li></ul><ul><li>Web Dynamics (e.g., Levene & Poulovassilis, 2001) </li></ul><ul><li>Webology (journal started 2004 by Alireza Noruzi) </li></ul><ul><li>Web Science (Berners-Lee et al., 2006) </li></ul>
    12. 12. webometrics <ul><li>the study of quantitative aspects of the construction and use of info. resources , structures and technologies on the Web, drawing on bibliometric and informetric approaches </li></ul>( Björneborn 2004) informetrics bibliometrics scientometrics webometrics cybermetrics
    13. 13. webometrics <ul><li>four main research areas of webometric concern: </li></ul><ul><ul><li>web page content analysis; </li></ul></ul><ul><ul><li>web link structure analysis; </li></ul></ul><ul><ul><li>web usage analysis (e.g., log files); </li></ul></ul><ul><ul><li>web technology analysis (e.g., search engine performance) </li></ul></ul>informetrics bibliometrics scientometrics webometrics cybermetrics ( Björneborn 2004)
    14. 14. web data collection <ul><li>non-standardized, messy data </li></ul><ul><ul><li>due to diversified, distributed, dynamic web </li></ul></ul><ul><ul><li>lack of metadata </li></ul></ul><ul><li>primary data </li></ul><ul><ul><li>own web crawler (beware: robot exclusion) </li></ul></ul><ul><ul><li>direct access to web servers incl. log files </li></ul></ul><ul><ul><li>Internet Archive (www.archive.org) </li></ul></ul><ul><ul><li>manual collection with browser </li></ul></ul><ul><li>secondary data </li></ul><ul><ul><li>search engines (beware: deficiencies) </li></ul></ul><ul><li>necessary data cleansing </li></ul><ul><ul><li>mirror sites, variant names, typo domains + links </li></ul></ul><ul><ul><li>many file formats, including misspellings </li></ul></ul>
    15. 15. examples of webometric analysis <ul><li>powerlaw distributions </li></ul><ul><ul><li>e.g. pages, outlinks, inlinks, visits per web site (Adamic & Huberman 2001) </li></ul></ul><ul><li>correlation between research indicators and inlinks </li></ul><ul><ul><li>e.g. UK, Taiwan, Australia (several studies by Thelwall et al.) </li></ul></ul><ul><ul><li>EU projects EICSTES + WISER </li></ul></ul><ul><li>co-inlink cluster analysis </li></ul><ul><ul><li>analogous to cocitation analysis </li></ul></ul><ul><ul><li>e.g. EU universities (Polanco et al. 2001) </li></ul></ul><ul><ul><li>e.g. Chinese IT companies (Vaughan & You 2005) </li></ul></ul><ul><li>longitudinal studies </li></ul><ul><ul><li>web page change and permanence (e.g. Koehler 2004) </li></ul></ul>
    16. 16. http:// www.scit.wlv.ac.uk /~cm1993/ mtpublications.html
    17. 17. small-world link analysis Björneborn (2004). Small-world link structures across an academic Web space: A library and information science approach . PhD Thesis. www.db.dk/LB based on graph theory and social network analysis
    18. 18. graph theory - Leonhard Euler (1707-1783), Königsberg (Wilson & Watkins 1990)
    19. 19. graph theory <ul><li>graph = mathematical modeling of network </li></ul><ul><ul><li>directed graph: e.g. www </li></ul></ul><ul><li>nodes (or vertices): A, B, C, D, E </li></ul><ul><li>edges (if directed: arcs, links): AC, EB, ... </li></ul><ul><li>degree : d(A) = 3 - outdegree: d O (A) = 2; indegree: d I (A) = 1 </li></ul><ul><li>directed walk : ACB: path length = 2 </li></ul><ul><li>geodetic distance : shortest path between 2 nodes </li></ul><ul><li>centrality </li></ul><ul><ul><li>global c.: least sum of geodetic distances </li></ul></ul><ul><ul><li>betweenness c.: most shortest paths pass node </li></ul></ul>Gross & Yellen (1999). Graph theory and its applications . E A B C D
    20. 20. graph theory applications <ul><li>graph theory used for mathematical modeling of networks </li></ul><ul><ul><li>e.g., biology, chemistry, physics, sociology, psychology, technology </li></ul></ul><ul><li>also applied in information sciences incl. bibliometrics </li></ul><ul><ul><li>citation networks (e.g., Garner, 1967; Doreian & Fararo, 1985; Hummon & Doreian, 1989; Shepherd, Watters & Cai, 1990; Egghe & Rousseau, 1990; Fang & Rousseau, 2001; Egghe & Rousseau, 2002; 2003a; 2003b) </li></ul></ul><ul><ul><li>information systems (e.g., Korfhage, Bhat & Nance, 1972) </li></ul></ul><ul><ul><li>hypertextual networks (e.g., Botafogo & Shneiderman, 1991; Smeaton, 1995; Furner, Ellis & Willett, 1996) </li></ul></ul>
    21. 21. social network analysis <ul><li>relations between actors in social network </li></ul><ul><li>sociometry - 1930s (Moreno) - sociograms </li></ul><ul><li>social networks - 1950s - social network analysis </li></ul><ul><li>makes use of mathematical graph theory </li></ul><ul><ul><li>Wasserman & Faust (1994). Social network analysis : methods and applications. Cambridge University Press. </li></ul></ul><ul><ul><li>Otte & Rousseau (2002). Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science , 28(6): 441-454 </li></ul></ul>
    22. 22. small-world networks <ul><li>small-world = highly clustered + short paths </li></ul><ul><ul><li>short distances through shortcuts between clusters in network </li></ul></ul><ul><ul><li>small-world = short local + short global distances </li></ul></ul><ul><ul><li>efficient diffusion of signals, contacts, ideas, viruses, etc. in networks </li></ul></ul><ul><li>social network analysis in 1960s: ’six degrees of separation’ </li></ul><ul><ul><li>today: ‘small worlds’ in biological, chemical, technical, social networks </li></ul></ul><ul><ul><li>brains, epidemics, scientific collaboration, semantic networks etc. </li></ul></ul>( Watts & Strogatz 1998)
    23. 23. <ul><li>most links connect similar topics  topical clusters </li></ul><ul><li>small-world web  cross-topic shortcuts </li></ul>
    24. 24. <ul><li>main research question </li></ul><ul><li>what types of web links , web pages and web sites function as cross-topic connectors in small-world link structures across an academic web space? </li></ul><ul><ul><li>objective : identify micro-level aspects of how small-world phenomena emerge </li></ul></ul>small-world link analysis Björneborn (2004). Small-world link structures across an academic Web space: A library and information science approach . PhD Thesis. www.db.dk/LB
    25. 25. UK link data 2001 <ul><li>109 UK universities </li></ul><ul><ul><li>web crawler, Thelwall </li></ul></ul><ul><li>7669 subsites </li></ul><ul><ul><li>www.hum.port.ac.uk </li></ul></ul><ul><ul><li>www.atm.ox.ac.uk </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><ul><li>departments, centres, research groups, etc. </li></ul></ul><ul><li>connections between 7669 subsites </li></ul><ul><ul><li>207 865 links </li></ul></ul><ul><ul><li>105 817 web pages </li></ul></ul>
    26. 26. ‘ corona’ graph model reachability structures 1893 SCC Strongest Connected Component 96 IN-Tendrils connected from IN 2660 OUT reachable from SCC 626 IN traversable to SCC 55 OUT-Tendrils connected to OUT 7 Tube connecting IN to OUT 2332 Dis-connected ( Björneborn 2004)
    27. 27. 10 seed nodes (stratified sampling in SCC component) 10 path nets with all shortest link paths between five pairs of topically dissimilar subsites Ophthalmology Dept, [eye research] Oxford Palaeontology Research Group, Earth Sciences Dept, Bristol Mathematics Dept, Glasgow Caledonian Chemistry Dept, Glasgow Atmospheric, Oceanic and Planetary Physics , Oxford eye.ox.ac.uk Geography Dept, Plymouth geog.plym.ac.uk palaeo.gly.bris.ac.uk Speech Research Group, Linguistics Dept, Essex speech.essex.ac.uk maths.gcal.ac.uk Psychology Dept, Manchester psy.man.ac.uk chem.gla.ac.uk Economics Dept, Southampton economics.soton. ac.uk atm.ox.ac.uk Faculty of Humanities and Social Sciences , Portsmouth hum.port.ac.uk
    28. 28. .ac.uk .uk cfd.me.umist.ac.uk ercoftac.mech.surrey.ac.uk cajun.cs.nott.ac.uk ukoln.bath.ac.uk cs.man.ac.uk ashmol.ox.ac.uk collections.ucl.ac.uk vlmp.museophile.sbu.ac.uk shortest link path
    29. 29. path net = ‘mini’ small world transversal link path net = all shortest link paths between two given nodes (subsites) network analysis tool = Pajek  adjacency matrix ( Björneborn 2006)
    30. 30. some indicative findings <ul><li>findings not generalizable: small, stratified sample </li></ul><ul><li>however: indicative findings may suggest </li></ul><ul><ul><li>computer-science sites = academic cross-topic connectors </li></ul></ul><ul><ul><li>personal link creators = web cohesion ‘ glue ’ – especially link lists </li></ul></ul><ul><ul><ul><li>researchers, PhD students, etc. are important providers of site outlinks and important receivers of site inlinks </li></ul></ul></ul><ul><ul><li>over 80% of cross-topic links academic (research, teaching) </li></ul></ul>
    31. 31. small-world web implications <ul><li>small local threads in the shape of users’ links affect how the global web is cohesive and may be traversed </li></ul><ul><ul><li>like ‘the strength of weak ties’ (Granovetter 1973) </li></ul></ul><ul><ul><li>knowledge diffusion and social cohesion across social groups </li></ul></ul><ul><li>counteract ‘balkanization’ </li></ul><ul><ul><li>disconnected / unreachable subpopulations </li></ul></ul><ul><li>reachability structures </li></ul><ul><ul><li>essential for web crawler harvests </li></ul></ul>
    32. 32. webometric study: genre connectivity <ul><li>what role do web page genres play for cohesion and reachability on the Web? [one of the first studies] </li></ul><ul><li>what types of web page genres function as link providers and link receivers between university web sites? </li></ul>
    33. 33. <ul><li>352 links </li></ul><ul><li>281 source pages </li></ul>genre connectivity analysis <ul><li>249 target pages </li></ul><ul><li>source pages and target pages in 10 path nets </li></ul>
    34. 34. meta genres
    35. 35. genre pairs
    36. 36. web of genres genre network graph extracted with Pajek software © Björneborn
    37. 37. genre connectivity <ul><li>academic web spaces = rich diversity of interlinked genres = diversified link motivations </li></ul><ul><li>personal link creators are important web cohesion builders </li></ul><ul><ul><li>personal link lists provide site outlinks </li></ul></ul><ul><ul><li>personal homepages receive site inlinks </li></ul></ul><ul><li>genre connectivity affect web cohesion and reachability by genre drift and topic drift </li></ul>
    38. 38. genre drift + topic drift <ul><li>topic clusters with genre diversity + genres with topical diversity </li></ul><ul><li>changes in page genres and page topics along link paths </li></ul><ul><li>genre drift within clusters + topic drift between clusters  short link distances (small world) </li></ul>
    39. 39. questions?
    40. 40. read more: Björneborn (2004). Small-world link structures across an academic web space : A library and information science approach. PhD dissertation. www.db.dk/LB Björneborn (2006). ‘Mini small worlds’ of shortest link paths crossing domain boundaries in an academic Web space. Scientometrics , 68(3): 395-414. Björneborn (forthcoming). Genre connectivity and genre drift in a web of genres. In: Mehler et al. Genres on the Web: Corpus Studies and Computational Models .

    ×