Measurement and modeling of the web and related data sets

IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide

Setup ,[object Object],[object Object]

Context ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Focus Areas ,[object Object],[object Object],[object Object],[object Object]

One view of the Internet: Inter-Domain Connectivity ,[object Object],[object Object],[object Object],Core Shells: 1 2 3 [Tauro, Palmer, Siganos, Faloutsos, 2001 Global Internet]

Another view of the web: the hyperlink graph ,[object Object],[object Object],[object Object]

Getting started – structure at the hyperlink level ,[object Object],[object Object],[object Object],[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]

Terminology ,[object Object],[object Object]

Data ,[object Object],[object Object],[object Object]

Breadth-first search from random starts ,[object Object]

Some distance measurements ,[object Object],[object Object],[object Object],[object Object]

Facts (about the crawl). ,[object Object],The distribution of indegrees on the web is given by a Power Law --- Heavy-tailed distribution, with many high-indegree pages (eg, Yahoo)

Analysis of power law Pr [ page has k inlinks ] =~ k -2.1 Pr [ page has > k inlinks ] =~ 1/ k Pr [ page has k outlinks ] =~ k -2.7 Corollary:

Component sizes. ,[object Object]

Other observed power laws in the web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Faloutsos, Faloutsos, Faloutsos 99] [Bharat, Chang, Henzinger, Ruhl 02]

More Characterization: Self-Similarity

Ways to Slice the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],We call these slices “Thematically Unified Communities”, or TUCs

Self-Similarity on the Web ,[object Object],[object Object],[object Object],[object Object],[object Object]

In particular… ,[object Object],[object Object],[object Object],[object Object],[object Object]

Is this surprising? ,[object Object],[object Object],[object Object],[object Object]

A structural explanation ,[object Object]

The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs

Information Extraction from Large Graphs

Overview WWW Distill KB1 KB2 KB3 Goal: Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]

Many approaches to this problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

General approach ,[object Object],[object Object],[object Object]

Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.

Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages ( ) that both Point to three other pages in common ( )

Communities and cores Example K 2,3 Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]

Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment

Subgraph enumeration ,[object Object]

Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing

Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing

The cores are interesting (1) Implicit communities are defined by cores. (2) There are an order of magnitude more of these. (10 5+ ) (3) Can grow the core to the community using further processing. Explicit communities. ,[object Object],[object Object],[object Object],[object Object],Implicit communities ,[object Object],[object Object],[object Object],[object Object]

Elementary Schools in Japan ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

So… ,[object Object],[object Object],[object Object],[object Object]

A word on evolution ,[object Object],[object Object],[object Object],[object Object],[object Object],[Kleinberg02]

Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1

More bursts ,[object Object],[object Object],[object Object],[object Object]

Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities

IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide

Probabilistic generative models ,[object Object],[object Object],[object Object],[object Object]

Models for Power Laws ,[object Object],[object Object],[object Object]

An Introduction to the Power Law ,[object Object],[object Object],[object Object],Exponentially-decaying distribution Power law distribution

Early Observations: Pareto on Income ,[object Object],[object Object],[object Object],[object Object]

Early Observations: Yule/Zipf ,[object Object],[object Object],[object Object],[object Object],[object Object]

Early Observations: Lotka on Citations ,[object Object]

Ranks versus Values ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Equivalence of rank versus value formulation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Bookstein90, Adamic99]

Early modeling work ,[object Object],[object Object],[object Object]

A model of Simon ,[object Object],[object Object],[object Object]

Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: Let f(i,t) be the number of words of count i at time t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”

The Generative Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Constructing a book: snapshot at time t Current word frequencies: Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1-  ) 1000 / K Pr[“of”] = (1-  ) 600 / K Pr[some count-1 word] = (1-  ) 1 * f(1,t) / K K =  if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”

What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket i occurs i times in the current document … .

What’s going on? 1 With probability  a new word is introduced into the text 2 3 4 5 6

What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-  an existing word is reused 2 3 5 6

What’s going on? 2 3 4 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value

Models for power laws in the web graph ,[object Object],[object Object],[object Object],[object Object],[object Object]

Why create such a model? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?

Desiderata for a graph model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Page creation on the web ,[object Object],[object Object],Model idea: new pages add links by "copying" them from existing pages

Generally, would require… ,[object Object],[object Object],[object Object],[object Object],[object Object]

A specific model ,[object Object],[object Object],[object Object],[object Object],[object Object]

Example New node arrives With probability  , it links to a uniformly-chosen page

Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.

Degree sequences in this model Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (  = 1/11 matches web) -(2-  ) (1-  )

Model extensions ,[object Object],[object Object],[object Object],[object Object]

A model of Mandelbrot ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Discussion of Mandelbrot’s model ,[object Object],[object Object]

Heuristically Optimized Trade-offs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Fabrikant, Koutsoupias, Papadimitriou 2002]

Monkeys on Typewriters ,[object Object],[object Object],[object Object],[object Object],[object Object]

Other Distributions ,[object Object],[object Object],[object Object],[object Object]

Quick characterization of lognormal distributions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

One final direction… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Measurement and modeling of the web and related data sets

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (18)

Similar to Measurement and modeling of the web and related data sets

Similar to Measurement and modeling of the web and related data sets (20)

More from Mark J. Feldman

More from Mark J. Feldman (20)

Recently uploaded

Recently uploaded (20)

Measurement and modeling of the web and related data sets