• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
 

Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift

on

  • 4,346 views

NORSLIS PhD course in informetrics, Umeå University, Sweden, 18 June 2008

NORSLIS PhD course in informetrics, Umeå University, Sweden, 18 June 2008

Statistics

Views

Total Views
4,346
Views on SlideShare
4,344
Embed Views
2

Actions

Likes
2
Downloads
40
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift Presentation Transcript

  • Webometrics 1.0 from AltaVista to Small Worlds and Genre Drift Lennart Björneborn Royal School of Library and Information Science [email_address] NORSLIS PhD course in informetrics Umeå 18.6.2008
  • outline
    • webometrics 1.0
      • birth of webometrics
      • early webometric research
    • two webometric studies
      • small-world link analysis
        • based on graph theory and social network analysis
      • genre connectivity analysis
    M.C. Escher: House of Stairs, 1951
  • WWW = largest network with available connectivity data Wood et al. (1995)
  • WWW = collaborative weaving = macro-level aggregations of micro-level interactions = reflect social, cultural formations Wood et al. (1995)
  • = keep track of ”the complex web of relationships between people, programs, machines and ideas” (Tim Berners-Lee, 1997) Wood et al. (1995) WWW
  • birth of webometrics
    • citation analogy
      • link = implicit recommendation of webpage
      • though also negative references
    • ’ Webometrics’ 1997 + ’Web Impact Factor’ 1998
      • Almind & Ingwersen (1997). Informetric analyses on the World Wide Web: methodological approaches to ‘webometrics’.
      • Ingwersen (1998). The calculation of Web impact factors.
    • Google ’Page Rank’ 1998
      • exploit link structures: who receives many links from someone who also receives many links from someone who also … ?
  • birth of webometrics: access to link data* linkdomain:norslis.net -site:norslis.net link:www.norslis.net -site:norslis.net (* cf. breakthrough of bibliometrics: access to citation data)
  • linkdomain:norslis.net -site:norslis.net
  • basic link terminology
    • B has an inlink from A : ~ citation
    • B has an outlink to C : ~ reference
    • B has a selflink : ~ self-citation
    • C and D have co-inlinks from B : ~ co-citation
    • B and E have co-outlinks to D : ~ bibliographic coupling
    A B D E G F H C co-links (Björneborn 2004)
  • some proposed web metrics
    • Netometrics (Bossy, 1995)
      • supplement bibliometrics and scientometrics in observing “science in action” on the Internet
    • Webometry (Abraham, 1996)
    • Internetometrics (Almind & Ingwersen, 1996)
    • Webometrics (Almind & Ingwersen, 1997)
    • Cybermetrics (journal started 1997 by Isidro Aguillo)
    • Web bibliometry (Chakrabarti et al., 2002)
  • some related web science
    • Web Mining (e.g., Etzioni, 1996; Kosala & Blockeel, 2000)
    • Web Ecology (e.g., Pitkow, 1997; Chi et al., 1998; Huberman, 2001)
    • Cyber Geography (e.g., Girardin, 1995)
    • Cyber Cartography (e.g., Dodge, 1999)
    • Web Graph Analysis (e.g., Kleinberg et al., 1999; Broder et al., 2000)
    • Web Dynamics (e.g., Levene & Poulovassilis, 2001)
    • Webology (journal started 2004 by Alireza Noruzi)
    • Web Science (Berners-Lee et al., 2006)
  • webometrics
    • the study of quantitative aspects of the construction and use of info. resources , structures and technologies on the Web, drawing on bibliometric and informetric approaches
    ( Björneborn 2004) informetrics bibliometrics scientometrics webometrics cybermetrics
  • webometrics
    • four main research areas of webometric concern:
      • web page content analysis;
      • web link structure analysis;
      • web usage analysis (e.g., log files);
      • web technology analysis (e.g., search engine performance)
    informetrics bibliometrics scientometrics webometrics cybermetrics ( Björneborn 2004)
  • web data collection
    • non-standardized, messy data
      • due to diversified, distributed, dynamic web
      • lack of metadata
    • primary data
      • own web crawler (beware: robot exclusion)
      • direct access to web servers incl. log files
      • Internet Archive (www.archive.org)
      • manual collection with browser
    • secondary data
      • search engines (beware: deficiencies)
    • necessary data cleansing
      • mirror sites, variant names, typo domains + links
      • many file formats, including misspellings
  • examples of webometric analysis
    • powerlaw distributions
      • e.g. pages, outlinks, inlinks, visits per web site (Adamic & Huberman 2001)
    • correlation between research indicators and inlinks
      • e.g. UK, Taiwan, Australia (several studies by Thelwall et al.)
      • EU projects EICSTES + WISER
    • co-inlink cluster analysis
      • analogous to cocitation analysis
      • e.g. EU universities (Polanco et al. 2001)
      • e.g. Chinese IT companies (Vaughan & You 2005)
    • longitudinal studies
      • web page change and permanence (e.g. Koehler 2004)
  • http:// www.scit.wlv.ac.uk /~cm1993/ mtpublications.html
  • small-world link analysis Björneborn (2004). Small-world link structures across an academic Web space: A library and information science approach . PhD Thesis. www.db.dk/LB based on graph theory and social network analysis
  • graph theory - Leonhard Euler (1707-1783), Königsberg (Wilson & Watkins 1990)
  • graph theory
    • graph = mathematical modeling of network
      • directed graph: e.g. www
    • nodes (or vertices): A, B, C, D, E
    • edges (if directed: arcs, links): AC, EB, ...
    • degree : d(A) = 3 - outdegree: d O (A) = 2; indegree: d I (A) = 1
    • directed walk : ACB: path length = 2
    • geodetic distance : shortest path between 2 nodes
    • centrality
      • global c.: least sum of geodetic distances
      • betweenness c.: most shortest paths pass node
    Gross & Yellen (1999). Graph theory and its applications . E A B C D
  • graph theory applications
    • graph theory used for mathematical modeling of networks
      • e.g., biology, chemistry, physics, sociology, psychology, technology
    • also applied in information sciences incl. bibliometrics
      • citation networks (e.g., Garner, 1967; Doreian & Fararo, 1985; Hummon & Doreian, 1989; Shepherd, Watters & Cai, 1990; Egghe & Rousseau, 1990; Fang & Rousseau, 2001; Egghe & Rousseau, 2002; 2003a; 2003b)
      • information systems (e.g., Korfhage, Bhat & Nance, 1972)
      • hypertextual networks (e.g., Botafogo & Shneiderman, 1991; Smeaton, 1995; Furner, Ellis & Willett, 1996)
  • social network analysis
    • relations between actors in social network
    • sociometry - 1930s (Moreno) - sociograms
    • social networks - 1950s - social network analysis
    • makes use of mathematical graph theory
      • Wasserman & Faust (1994). Social network analysis : methods and applications. Cambridge University Press.
      • Otte & Rousseau (2002). Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science , 28(6): 441-454
  • small-world networks
    • small-world = highly clustered + short paths
      • short distances through shortcuts between clusters in network
      • small-world = short local + short global distances
      • efficient diffusion of signals, contacts, ideas, viruses, etc. in networks
    • social network analysis in 1960s: ’six degrees of separation’
      • today: ‘small worlds’ in biological, chemical, technical, social networks
      • brains, epidemics, scientific collaboration, semantic networks etc.
    ( Watts & Strogatz 1998)
    • most links connect similar topics  topical clusters
    • small-world web  cross-topic shortcuts
    • main research question
    • what types of web links , web pages and web sites function as cross-topic connectors in small-world link structures across an academic web space?
      • objective : identify micro-level aspects of how small-world phenomena emerge
    small-world link analysis Björneborn (2004). Small-world link structures across an academic Web space: A library and information science approach . PhD Thesis. www.db.dk/LB
  • UK link data 2001
    • 109 UK universities
      • web crawler, Thelwall
    • 7669 subsites
      • www.hum.port.ac.uk
      • www.atm.ox.ac.uk
      • ...
      • departments, centres, research groups, etc.
    • connections between 7669 subsites
      • 207 865 links
      • 105 817 web pages
  • ‘ corona’ graph model reachability structures 1893 SCC Strongest Connected Component 96 IN-Tendrils connected from IN 2660 OUT reachable from SCC 626 IN traversable to SCC 55 OUT-Tendrils connected to OUT 7 Tube connecting IN to OUT 2332 Dis-connected ( Björneborn 2004)
  • 10 seed nodes (stratified sampling in SCC component) 10 path nets with all shortest link paths between five pairs of topically dissimilar subsites Ophthalmology Dept, [eye research] Oxford Palaeontology Research Group, Earth Sciences Dept, Bristol Mathematics Dept, Glasgow Caledonian Chemistry Dept, Glasgow Atmospheric, Oceanic and Planetary Physics , Oxford eye.ox.ac.uk Geography Dept, Plymouth geog.plym.ac.uk palaeo.gly.bris.ac.uk Speech Research Group, Linguistics Dept, Essex speech.essex.ac.uk maths.gcal.ac.uk Psychology Dept, Manchester psy.man.ac.uk chem.gla.ac.uk Economics Dept, Southampton economics.soton. ac.uk atm.ox.ac.uk Faculty of Humanities and Social Sciences , Portsmouth hum.port.ac.uk
  • .ac.uk .uk cfd.me.umist.ac.uk ercoftac.mech.surrey.ac.uk cajun.cs.nott.ac.uk ukoln.bath.ac.uk cs.man.ac.uk ashmol.ox.ac.uk collections.ucl.ac.uk vlmp.museophile.sbu.ac.uk shortest link path
  • path net = ‘mini’ small world transversal link path net = all shortest link paths between two given nodes (subsites) network analysis tool = Pajek  adjacency matrix ( Björneborn 2006)
  • some indicative findings
    • findings not generalizable: small, stratified sample
    • however: indicative findings may suggest
      • computer-science sites = academic cross-topic connectors
      • personal link creators = web cohesion ‘ glue ’ – especially link lists
        • researchers, PhD students, etc. are important providers of site outlinks and important receivers of site inlinks
      • over 80% of cross-topic links academic (research, teaching)
  • small-world web implications
    • small local threads in the shape of users’ links affect how the global web is cohesive and may be traversed
      • like ‘the strength of weak ties’ (Granovetter 1973)
      • knowledge diffusion and social cohesion across social groups
    • counteract ‘balkanization’
      • disconnected / unreachable subpopulations
    • reachability structures
      • essential for web crawler harvests
  • webometric study: genre connectivity
    • what role do web page genres play for cohesion and reachability on the Web? [one of the first studies]
    • what types of web page genres function as link providers and link receivers between university web sites?
    • 352 links
    • 281 source pages
    genre connectivity analysis
    • 249 target pages
    • source pages and target pages in 10 path nets
  • meta genres
  • genre pairs
  • web of genres genre network graph extracted with Pajek software © Björneborn
  • genre connectivity
    • academic web spaces = rich diversity of interlinked genres = diversified link motivations
    • personal link creators are important web cohesion builders
      • personal link lists provide site outlinks
      • personal homepages receive site inlinks
    • genre connectivity affect web cohesion and reachability by genre drift and topic drift
  • genre drift + topic drift
    • topic clusters with genre diversity + genres with topical diversity
    • changes in page genres and page topics along link paths
    • genre drift within clusters + topic drift between clusters  short link distances (small world)
  • questions?
  • read more: Björneborn (2004). Small-world link structures across an academic web space : A library and information science approach. PhD dissertation. www.db.dk/LB Björneborn (2006). ‘Mini small worlds’ of shortest link paths crossing domain boundaries in an academic Web space. Scientometrics , 68(3): 395-414. Björneborn (forthcoming). Genre connectivity and genre drift in a web of genres. In: Mehler et al. Genres on the Web: Corpus Studies and Computational Models .