SlideShare a Scribd company logo
IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
Setup ,[object Object],[object Object]
Context ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Focus Areas ,[object Object],[object Object],[object Object],[object Object]
One view of the Internet: Inter-Domain Connectivity ,[object Object],[object Object],[object Object],Core Shells: 1 2 3 [Tauro,   Palmer, Siganos, Faloutsos, 2001 Global Internet]
Another view of the web: the hyperlink graph ,[object Object],[object Object],[object Object]
Getting started – structure at the hyperlink level ,[object Object],[object Object],[object Object],[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]
Terminology ,[object Object],[object Object]
Data ,[object Object],[object Object],[object Object]
Breadth-first search from random starts ,[object Object]
A Picture of (~200M) pages.
Some distance measurements ,[object Object],[object Object],[object Object],[object Object]
Facts (about the crawl). ,[object Object],The distribution of indegrees on the web is given by a Power Law --- Heavy-tailed distribution, with many high-indegree pages (eg, Yahoo)
Analysis of power law Pr [ page has  k  inlinks ]  =~  k -2.1 Pr [ page has >  k  inlinks ]  =~  1/ k Pr [ page has  k  outlinks ]  =~  k -2.7 Corollary:
Component sizes. ,[object Object]
Other observed power laws in the web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Faloutsos, Faloutsos, Faloutsos 99] [Bharat, Chang, Henzinger, Ruhl 02]
More Characterization: Self-Similarity
Ways to Slice the Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],We call these slices “Thematically Unified Communities”, or TUCs
Self-Similarity on the Web ,[object Object],[object Object],[object Object],[object Object],[object Object]
In particular… ,[object Object],[object Object],[object Object],[object Object],[object Object]
Is this surprising? ,[object Object],[object Object],[object Object],[object Object]
A structural explanation ,[object Object]
The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
Information Extraction from Large Graphs
Overview WWW Distill KB1 KB2 KB3 Goal:  Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
Many approaches to this problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
General approach ,[object Object],[object Object],[object Object]
Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages (  ) that both Point to three other pages in common (  )
Communities and cores Example K 2,3 Definition:  A "core" K ij consists of  i  left nodes, j  right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment
Subgraph enumeration ,[object Object]
Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
The cores are interesting (1) Implicit communities are defined by cores. (2) There are an order of  magnitude more of these.  (10 5+ ) (3) Can grow the core to the community using further processing. Explicit communities. ,[object Object],[object Object],[object Object],[object Object],Implicit communities ,[object Object],[object Object],[object Object],[object Object]
Elementary Schools in Japan ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
So… ,[object Object],[object Object],[object Object],[object Object]
A word on evolution
A word on evolution ,[object Object],[object Object],[object Object],[object Object],[object Object],[Kleinberg02]
Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
More bursts ,[object Object],[object Object],[object Object],[object Object]
Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
Probabilistic generative models ,[object Object],[object Object],[object Object],[object Object]
Models for Power Laws ,[object Object],[object Object],[object Object]
An Introduction to the Power Law ,[object Object],[object Object],[object Object],Exponentially-decaying distribution Power law distribution
Early Observations: Pareto on Income ,[object Object],[object Object],[object Object],[object Object]
Early Observations: Yule/Zipf ,[object Object],[object Object],[object Object],[object Object],[object Object]
Early Observations: Lotka on Citations ,[object Object]
Ranks versus Values ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Equivalence of rank versus value formulation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Bookstein90, Adamic99]
Early modeling work ,[object Object],[object Object],[object Object]
A model of Simon ,[object Object],[object Object],[object Object]
Constructing a book: snapshot at time  t When in the course of human events, it becomes necessary… Current word frequencies:  Let  f(i,t)  be the number of words of count  i  at time  t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
The Generative Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Constructing a book: snapshot at time  t Current word frequencies:  Let  f(i,t)  be the number of words of count  i  at time  t Pr[“the”] = (1-   ) 1000 / K Pr[“of”] = (1-   ) 600 / K Pr[some count-1 word] = (1-   ) 1 *  f(1,t)  / K K =   if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket  i  occurs  i  times in the current document … .
What’s going on? 1 With probability    a new word is introduced into the text 2 3 4 5 6
What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-   an existing word is reused 2 3 5 6
What’s going on? 2 3 4 Size of bucket 3 at time  t+1  depends only on sizes of buckets 2 and 3 at time  t ? ? Must show: fraction of balls in 3 rd  bucket approaches some limiting value
Models for power laws in the web graph ,[object Object],[object Object],[object Object],[object Object],[object Object]
Why create such a model? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
Desiderata for a graph model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Page creation on the web ,[object Object],[object Object],Model idea:  new pages add links by "copying" them from existing pages
Generally, would require… ,[object Object],[object Object],[object Object],[object Object],[object Object]
A specific model ,[object Object],[object Object],[object Object],[object Object],[object Object]
Example New node arrives With probability   , it links to a uniformly-chosen page
Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.
Degree sequences in this model Pr[page has  k  inlinks]  =~  k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (   = 1/11 matches web) -(2-  ) (1-  )
Model extensions ,[object Object],[object Object],[object Object],[object Object]
A model of Mandelbrot ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Discussion of Mandelbrot’s model ,[object Object],[object Object]
Heuristically Optimized Trade-offs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[Fabrikant, Koutsoupias, Papadimitriou 2002]
Monkeys on Typewriters ,[object Object],[object Object],[object Object],[object Object],[object Object]
Other Distributions ,[object Object],[object Object],[object Object],[object Object]
Quick characterization of lognormal distributions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
One final direction… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKSA LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
csandit
 
Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...
IEEEFINALYEARPROJECTS
 
Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Exploring Social Media with NodeXL
Exploring Social Media with NodeXL
Shalin Hai-Jew
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
 
IT6701 Information Management - Unit I
IT6701 Information Management - Unit I  IT6701 Information Management - Unit I
IT6701 Information Management - Unit I
pkaviya
 
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSAPPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSIJwest
 

What's hot (8)

A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKSA LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
 
Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...Geo community-based broadcasting for data dissemination in mobile social netw...
Geo community-based broadcasting for data dissemination in mobile social netw...
 
Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Exploring Social Media with NodeXL
Exploring Social Media with NodeXL
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
tubes_final
tubes_finaltubes_final
tubes_final
 
IT6701 Information Management - Unit I
IT6701 Information Management - Unit I  IT6701 Information Management - Unit I
IT6701 Information Management - Unit I
 
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKSAPPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 

Viewers also liked

Venture capital investment
Venture capital investmentVenture capital investment
Venture capital investment
Rohit Kumar
 
Venture capital
Venture capitalVenture capital
Venture capital
Anupam Kumar
 
Reporte del clima estados de méxico
Reporte del clima estados de méxicoReporte del clima estados de méxico
Reporte del clima estados de méxicodiegorubenrpdriguez
 
Selecting financial strategies
Selecting financial strategiesSelecting financial strategies
Selecting financial strategiesgemdeane1
 
strategic financial management
strategic financial managementstrategic financial management
strategic financial managementDevansh Thapa
 
Venture capital
Venture capitalVenture capital
Venture capital
Rajendran Ananda Krishnan
 
Venture capital
Venture capitalVenture capital
Venture capital
Parneet Walia
 
Venture Capital
Venture CapitalVenture Capital
Venture Capital
shashank gupta
 
Venture capital
Venture capitalVenture capital
Venture capital
Samraiz Tejani
 
Introduction to Venture Capital
Introduction to Venture CapitalIntroduction to Venture Capital
Introduction to Venture Capital
pricew
 
Venture capital power point presentation
Venture capital power point presentationVenture capital power point presentation
Venture capital power point presentationKarthik S Raj
 
Venture capital
Venture capital Venture capital
Venture capital
Omkar More
 
Financial strategy
Financial strategyFinancial strategy
Financial strategy
Prashant Mehta
 
What is venture capital & venture capital in india
What is venture capital & venture capital in indiaWhat is venture capital & venture capital in india
What is venture capital & venture capital in india
Sandeep Mane
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial managementShaikh Abdulsaeed
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
BINOY JOHN
 

Viewers also liked (18)

Venture capital investment
Venture capital investmentVenture capital investment
Venture capital investment
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Reporte del clima estados de méxico
Reporte del clima estados de méxicoReporte del clima estados de méxico
Reporte del clima estados de méxico
 
Selecting financial strategies
Selecting financial strategiesSelecting financial strategies
Selecting financial strategies
 
strategic financial management
strategic financial managementstrategic financial management
strategic financial management
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Venture Capital
Venture CapitalVenture Capital
Venture Capital
 
Venture capital
Venture capitalVenture capital
Venture capital
 
Introduction to Venture Capital
Introduction to Venture CapitalIntroduction to Venture Capital
Introduction to Venture Capital
 
Venture capital power point presentation
Venture capital power point presentationVenture capital power point presentation
Venture capital power point presentation
 
Venture capital
Venture capital Venture capital
Venture capital
 
Venture capital presentation
Venture capital presentationVenture capital presentation
Venture capital presentation
 
Venture capital ppt
Venture capital pptVenture capital ppt
Venture capital ppt
 
Financial strategy
Financial strategyFinancial strategy
Financial strategy
 
What is venture capital & venture capital in india
What is venture capital & venture capital in indiaWhat is venture capital & venture capital in india
What is venture capital & venture capital in india
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
 
Strategic financial management
Strategic financial managementStrategic financial management
Strategic financial management
 

Similar to Measurement and modeling of the web and related data sets

Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
University of Washington
 
2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copyvafopoulos
 
F14 lec12graphs
F14 lec12graphsF14 lec12graphs
F14 lec12graphs
ankush karwa
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
Armin Haller
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
P118 gummadi
P118 gummadiP118 gummadi
P118 gummadifoufa31
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
Dr. Aparna Varde
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
Biocomplexity Institute of Virginia Tech
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
Geoffrey Fox
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performance
Jisc
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
dclsocialmedia
 
B036407011
B036407011B036407011
B036407011
theijes
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Denis Shestakov
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
Geoffrey Fox
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
csandit
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
csandit
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
dailyye
 

Similar to Measurement and modeling of the web and related data sets (20)

Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy2010 06-08 chania stochastic web modelling - copy
2010 06-08 chania stochastic web modelling - copy
 
F14 lec12graphs
F14 lec12graphsF14 lec12graphs
F14 lec12graphs
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
P118 gummadi
P118 gummadiP118 gummadi
P118 gummadi
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performance
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
B036407011
B036407011B036407011
B036407011
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
 

More from Mark J. Feldman

The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US Economy
Mark J. Feldman
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal TermsMark J. Feldman
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
Mark J. Feldman
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
Mark J. Feldman
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market OpportunityMark J. Feldman
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)
Mark J. Feldman
 
Email Marketing 101
Email Marketing 101Email Marketing 101
Email Marketing 101
Mark J. Feldman
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookMark J. Feldman
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
Mark J. Feldman
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Mark J. Feldman
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
Mark J. Feldman
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
Mark J. Feldman
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004
Mark J. Feldman
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
Mark J. Feldman
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
Mark J. Feldman
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
Mark J. Feldman
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
Mark J. Feldman
 
Sub Prime Explanation
Sub Prime ExplanationSub Prime Explanation
Sub Prime Explanation
Mark J. Feldman
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
Mark J. Feldman
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
Mark J. Feldman
 

More from Mark J. Feldman (20)

The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US Economy
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)
 
Email Marketing 101
Email Marketing 101Email Marketing 101
Email Marketing 101
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
 
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Oper...
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
 
Sub Prime Explanation
Sub Prime ExplanationSub Prime Explanation
Sub Prime Explanation
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 

Measurement and modeling of the web and related data sets

  • 1. IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. A Picture of (~200M) pages.
  • 12.
  • 13.
  • 14. Analysis of power law Pr [ page has k inlinks ] =~ k -2.1 Pr [ page has > k inlinks ] =~ 1/ k Pr [ page has k outlinks ] =~ k -2.7 Corollary:
  • 15.
  • 16.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
  • 25. Overview WWW Distill KB1 KB2 KB3 Goal: Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]
  • 26.
  • 27.
  • 28. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.
  • 29. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages ( ) that both Point to three other pages in common ( )
  • 30. Communities and cores Example K 2,3 Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]
  • 31. Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment
  • 32.
  • 33. Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing
  • 34. Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing
  • 35.
  • 36.
  • 37.
  • 38. A word on evolution
  • 39.
  • 40. Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1
  • 41.
  • 42. Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities
  • 43. IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: Let f(i,t) be the number of words of count i at time t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
  • 55.
  • 56. Constructing a book: snapshot at time t Current word frequencies: Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1-  ) 1000 / K Pr[“of”] = (1-  ) 600 / K Pr[some count-1 word] = (1-  ) 1 * f(1,t) / K K =  if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”
  • 57. What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket i occurs i times in the current document … .
  • 58. What’s going on? 1 With probability  a new word is introduced into the text 2 3 4 5 6
  • 59. What’s going on? 1 4 How many times do words in this bucket occur? With probability 1-  an existing word is reused 2 3 5 6
  • 60. What’s going on? 2 3 4 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value
  • 61.
  • 62.
  • 63. Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?
  • 64.
  • 65.
  • 66.
  • 67.
  • 68. Example New node arrives With probability  , it links to a uniformly-chosen page
  • 69. Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1-  ), it decides to copy a link.
  • 70. Degree sequences in this model Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs (  = 1/11 matches web) -(2-  ) (1-  )
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.