SlideShare a Scribd company logo
Challenges in Running a Commercial
        Web Search Engine

           Amit Singhal
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
Introduction
• Crawling
  – Follow links to find information
• Indexing
  – Record what words appear where
• Ranking
  – What information is a good match to a user query?
  – What information is inherently good?
• Displaying
  – Find a good format for the information
• Serving
  – Handle queries, find pages, display results
History
• The web happened (1992)
• Mosaic/Netscape happened (1993-95)
• Crawler happened (1994): M. Mauldin
• SEs happened 1994-1996
   – InfoSeek, Lycos, Altavista, Excite, Inktomi, …
• Yahoo decided to go with a directory
• Google happened 1996-98
   – Tried selling technology to other engines
   – SEs though search was a commodity, portals were in
• Microsoft said: whatever …
Present
• Most search engines have vanished
• Google is a big player
• Yahoo decided to de-emphasize directories
   – Buys three search engines
• Microsoft realized Internet is here to stay
   – Dominates the browser market
   – Realizes search is critical
History
• Early systems Information Retrieval
  based
  – Infoseek, Altavista, …
• Information Retrieval
  –   Field started in the 1950s
  –   Primarily focused on text search
  –   Already had written-off directories (1960s)
  –   Mostly uses statistical methods to analyze text
History
• IR necessary but not sufficient for web
  search
• Doesn’t capture authority
  – Same article hosted on BBC as good as a slightly
    modified copy on john-doe-news.com
• Doesn’t address web navigation
  – Query ibm seeks www.ibm.com
  – To IR www.ibm.com may look less topical than a
    quarterly report
History

• But there are links
  – Long history in citation analysis
  – Navigational tools on the web
  – Also a sign of popularity
  – Can be thought of as recommendations (source
    recommends destination)
  – Also describe the destination: anchor text
History

• Link analysis
  – Hubs and authority (Jon Kleinberg)
     • Topical links exploited
     • Query time approach
  – PageRank (Brin and Page)
     • Computed on the entire graph
     • Query independent
     • Faster if serving lots of queries
  – Others…
History
• Google showed link analysis can make
  a huge difference and is practical too
  – Everyone else followed

• Then there is the secret sauce
  –   Link analysis
  –   Information retrieval
  –   Anchor text
  –   Other stuff
History
• Interfaces
  – Many alternatives existed/exist
     •   Simple ranked list
     •   Keywords in context snippets (Google first SE to do this)
     •   Topics/query suggestion tools (e.g. Vivisimo, Teoma)
     •   Graphical, 2-D, 3-D
  – Simple and clean preferred by users
     • Like relevance ranking
     • Like keywords in context snippets
End Product
• As of today
  – Users give a 2-4 word query
  – SE gives a relevance ranked list of web pages
  – Most users click only on the first few results
  – Few users go below the fold
     • Whatever is visible without scrolling down

  – Far fewer ask for the next 10 results
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
Oh No … This is REAL
• 80% of users use search engines to find sites
Enter the Greedy Spammer
• Users follow search results
• Money follows users, spam follows …
• There is value in getting ranked high
  – Affiliate programs
     • Siphon traffic from SEs to Amazon/eBay/…
         – Make a few bucks
     • Siphon traffic from SEs to a Viagra seller
         – Make $6 per sale
     • Siphon traffic from SEs to a porn site
         – Make $20-$40 per new member
Big Money
• Let’s do the math
• How much can the spam industry make
  by spamming search engines?
  – Assume 500M searches/day on the web
     • All search engines combined
  – Assume 5% commercially viable
     • Much more if you include porn queries
  – Assume $0.50 made per click (from 5c to $40)
  – $12.5M/day or about $4.5 Billion/year
How?
• Defeat IR
  – Keyword stuffing
  – Crawlers declare that it is a SE spider
  – They dish us an “optimized” page
But that should be easy…
• Just detect keyword density
But that is easy too…
• Just detect that page is not about query
Legitimate NLP Parse
• Noun phrase to noun phrase
But links should help…
• No one should link to these bad sites
  – Expired domains
     • The owner of a legitimate domain doesn’t renew it
     • Spammers grab it, it already has tons of incoming links
     • E.g., anchor text for
         – The War on Freedom
         – The War on Freedom:
           How and Why America
           was attacked
         – The War on Freedom
Get Links
Guestbooks
Get Links
Mailing lists
Get Links
Link Exchange
State of Affairs
• There is big money in spamming SEs
• Easy to get links from good sites
• Easy to generate search algorithm
  friendly pages
• Any technique can be and will be
  attacked by spammers
• Have to make sense out of this chaos
We counter it well
• Most SEs are still very useful
  – Used over 500 million times every day
     • All search engines put together

• Our internal measurements show that
  we are winning
• Still need to be watchful
And then…
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
Information Retrieval
• Test collection paradigm of evaluation
  –   Static collection of documents (few million)
  –   A set of queries (around 50-100)
  –   Relevance judgments
  –   Extensive judgments not possible (100x1,000,000)
  –   Use pooling
       • Pool top 1000 results from various techniques
       • Assume all possible relevant documents judged
       • Biased against revolutionary new methods
          – Judge new documents if needed
On the Web
• Collection is dynamic
  – 10-20% urls change every month
  – Spam methods are dynamic
  – Need to keep the collection recent
• Queries are also time sensitive
  – Topics are hot then not
  – Need to keep a representative sample
On the Web
• Search space is HUGE
  – Over 200 million queries a day
  – Over 100 million are unique
  – Need 2700 queries for a 5% (700 for 10%) improvement to
    be meaningful at 95% confidence
• Search space is varied
  – Serve 90 different languages
  – Can’t have a catastrophic failure in any
  – Monitoring every part of the system is non-trivial
• IR style evaluation
  – Incredibly expensive
  – Always out of date
On the Web
• But what about user behavior?
  – You can use clicks as supervision.
• Clicks
  – Incredibly noisy
  – A click on a result does not mean a vote for it
     • The destination may just be a traffic peddler
     • User taken to some other site
     • If anything, this (clicked) result was BAD
Blue and Gold Fleet
We do Very Well
• Continually evaluate our system
  – In multiple languages
  – Tests valid over large traffic
  – Caught many possible disasters
• Constantly launch changes/products
  – Stemming, Google News, Froogle, Usenet, …
Overview

• Introduction/History
• Search Engine Spam
• Evaluation Challenge
• Google
  – Finding Needles in a 20 TB Haystack, 200M times per day
Past

1995 research project at Stanford University
Lego Disk Case

One of our earliest storage systems
Peak of google.stanford.edu
Growth
• Nov. 98: 10,000 queries on 25 computers
• Apr. 99: 500,000 queries on 300 computers
• Sept. 99: 3M queries on 2,100 computers
Servers 1999
Datacenters now



       And 3 days later…
Where the users are…
What can we learn…

•   Structure of Web
•   Interests of Users
•   Trends and Fads
•   Languages
•   Concepts
•   Relationships
Spelling Correction: Britney Spears
Google

• Ethics
  – No pay for inclusion (in index)
  – No pay for placement (in ranking)
  – Clearly demarked results and ads
  – 20% engineer time doing random stuff
    • Out came news, froogle, orkut
  – Users come first
Recent launches…
Recent launches…
Some perks…
Our Chef Charlie…
Thank You…


Amit
Singhal

More Related Content

Viewers also liked

Comenzi tastatura
Comenzi tastaturaComenzi tastatura
Comenzi tastatura
Cîndea Radu
 
Modal verbs
Modal verbsModal verbs
Modal verbs
maji_martinez
 
What's Hot On Facebook - 19/10/2011
What's Hot On Facebook - 19/10/2011What's Hot On Facebook - 19/10/2011
What's Hot On Facebook - 19/10/2011
David Nattriss
 
Tudósok akik hittek a hatnapos teremtésben - frissítve
Tudósok akik hittek a hatnapos teremtésben - frissítveTudósok akik hittek a hatnapos teremtésben - frissítve
Tudósok akik hittek a hatnapos teremtésben - frissítve
Curcubet Gabriel
 
Live800の導入【初期設定編】
Live800の導入【初期設定編】Live800の導入【初期設定編】
Live800の導入【初期設定編】
Live 800
 
SKF First-quarter report 2011
SKF First-quarter report 2011SKF First-quarter report 2011
SKF First-quarter report 2011
SKF
 
What Makes A Green Telco
What Makes A Green TelcoWhat Makes A Green Telco
What Makes A Green Telco
Turlough Guerin
 
Portuguese scientific output on Web of Science and on Scopus: a comparative ...
Portuguese scientific output on Web of Science and on Scopus:  a comparative ...Portuguese scientific output on Web of Science and on Scopus:  a comparative ...
Portuguese scientific output on Web of Science and on Scopus: a comparative ...
Teresa Costa
 
Mett sociaal intranet inclusief case gemeente putten
Mett sociaal intranet inclusief case gemeente puttenMett sociaal intranet inclusief case gemeente putten
Mett sociaal intranet inclusief case gemeente putten
Jeroen Rispens
 
Basic PC Skills
Basic PC SkillsBasic PC Skills
Basic PC Skills
adisg
 
Eagle Mountain Utah Silver Lake Stake Fireside Slides
Eagle Mountain Utah Silver Lake Stake Fireside SlidesEagle Mountain Utah Silver Lake Stake Fireside Slides
Eagle Mountain Utah Silver Lake Stake Fireside Slides
Steve Davis
 
Ireland the road to partition guzman 2011
Ireland the road to partition guzman 2011Ireland the road to partition guzman 2011
Ireland the road to partition guzman 2011
Patricia Guzman
 
Challenges in Modeling Believable Social Agents
Challenges in Modeling Believable Social AgentsChallenges in Modeling Believable Social Agents
Challenges in Modeling Believable Social Agents
Eva Hudlicka
 
Forsta kvartalet 2010
Forsta kvartalet 2010Forsta kvartalet 2010
Forsta kvartalet 2010SKF
 
Ice Strategic Overview 2011 V1
Ice Strategic Overview 2011   V1Ice Strategic Overview 2011   V1
Ice Strategic Overview 2011 V1
jrowley9999
 
Lifestyle Collection Presentation 2012
Lifestyle Collection Presentation   2012Lifestyle Collection Presentation   2012
Lifestyle Collection Presentation 2012
jrowley9999
 
Omega 3s
Omega 3sOmega 3s
Leadership and Virtue
Leadership and VirtueLeadership and Virtue
Leadership and Virtue
LSC-CyFair Library, LIFE Workshops
 

Viewers also liked (19)

Comenzi tastatura
Comenzi tastaturaComenzi tastatura
Comenzi tastatura
 
Modal verbs
Modal verbsModal verbs
Modal verbs
 
What's Hot On Facebook - 19/10/2011
What's Hot On Facebook - 19/10/2011What's Hot On Facebook - 19/10/2011
What's Hot On Facebook - 19/10/2011
 
Tudósok akik hittek a hatnapos teremtésben - frissítve
Tudósok akik hittek a hatnapos teremtésben - frissítveTudósok akik hittek a hatnapos teremtésben - frissítve
Tudósok akik hittek a hatnapos teremtésben - frissítve
 
Live800の導入【初期設定編】
Live800の導入【初期設定編】Live800の導入【初期設定編】
Live800の導入【初期設定編】
 
SKF First-quarter report 2011
SKF First-quarter report 2011SKF First-quarter report 2011
SKF First-quarter report 2011
 
What Makes A Green Telco
What Makes A Green TelcoWhat Makes A Green Telco
What Makes A Green Telco
 
Portuguese scientific output on Web of Science and on Scopus: a comparative ...
Portuguese scientific output on Web of Science and on Scopus:  a comparative ...Portuguese scientific output on Web of Science and on Scopus:  a comparative ...
Portuguese scientific output on Web of Science and on Scopus: a comparative ...
 
Mett sociaal intranet inclusief case gemeente putten
Mett sociaal intranet inclusief case gemeente puttenMett sociaal intranet inclusief case gemeente putten
Mett sociaal intranet inclusief case gemeente putten
 
Basic PC Skills
Basic PC SkillsBasic PC Skills
Basic PC Skills
 
Eagle Mountain Utah Silver Lake Stake Fireside Slides
Eagle Mountain Utah Silver Lake Stake Fireside SlidesEagle Mountain Utah Silver Lake Stake Fireside Slides
Eagle Mountain Utah Silver Lake Stake Fireside Slides
 
Ireland the road to partition guzman 2011
Ireland the road to partition guzman 2011Ireland the road to partition guzman 2011
Ireland the road to partition guzman 2011
 
Challenges in Modeling Believable Social Agents
Challenges in Modeling Believable Social AgentsChallenges in Modeling Believable Social Agents
Challenges in Modeling Believable Social Agents
 
Forsta kvartalet 2010
Forsta kvartalet 2010Forsta kvartalet 2010
Forsta kvartalet 2010
 
Ice Strategic Overview 2011 V1
Ice Strategic Overview 2011   V1Ice Strategic Overview 2011   V1
Ice Strategic Overview 2011 V1
 
Lifestyle Collection Presentation 2012
Lifestyle Collection Presentation   2012Lifestyle Collection Presentation   2012
Lifestyle Collection Presentation 2012
 
Royal Wedding Madness
Royal Wedding MadnessRoyal Wedding Madness
Royal Wedding Madness
 
Omega 3s
Omega 3sOmega 3s
Omega 3s
 
Leadership and Virtue
Leadership and VirtueLeadership and Virtue
Leadership and Virtue
 

Similar to Haifa

Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
David Nzoputa Ofili
 
Search Engine Google
Search Engine GoogleSearch Engine Google
Search Engine Google
Chidanand Byahatti
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
rvguha
 
Evaluating search engines
Evaluating search enginesEvaluating search engines
Evaluating search engines
Phil Bradley
 
Search Engine Optimization and Analytics for CSEPP Advanced Training Course
Search Engine Optimization and Analytics for CSEPP Advanced Training CourseSearch Engine Optimization and Analytics for CSEPP Advanced Training Course
Search Engine Optimization and Analytics for CSEPP Advanced Training Course
Bryan Campbell
 
Troy Henikoff - Entrepreneur's Guide to SEO
Troy Henikoff - Entrepreneur's Guide to SEOTroy Henikoff - Entrepreneur's Guide to SEO
Troy Henikoff - Entrepreneur's Guide to SEO
rmannino
 
Search Engine Marketing (Oldschool) - an introduction.
Search Engine Marketing (Oldschool) - an introduction.Search Engine Marketing (Oldschool) - an introduction.
Search Engine Marketing (Oldschool) - an introduction.
Tim Vermeire
 
Designing recommender system for your application
Designing  recommender system for  your applicationDesigning  recommender system for  your application
Designing recommender system for your application
孜羲 顏
 
User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...
User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...
User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...
Jason Hong
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
ppt
pptppt
Basics of search engines and algorithms (1)
Basics of search engines and algorithms (1)Basics of search engines and algorithms (1)
Basics of search engines and algorithms (1)
kongara
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Roi Blanco
 
Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)
Duncan MacGruer
 
Internal Search - The Lost Child of Web Analytics
Internal Search - The Lost Child of Web AnalyticsInternal Search - The Lost Child of Web Analytics
Internal Search - The Lost Child of Web Analytics
Charles Meaden
 
Internet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptxInternet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptx
AhmedAmerica
 
Personalized search
Personalized searchPersonalized search
Personalized search
Toine Bogers
 
Search Engine Optimisation: A High Level View
Search Engine Optimisation: A High Level ViewSearch Engine Optimisation: A High Level View
Search Engine Optimisation: A High Level View
justin spratt
 
SEO Presentation
SEO PresentationSEO Presentation
SEO Presentation
Alli Berry
 
Phil Morgan, Head of Search, talks to students at the University of Salford
Phil Morgan, Head of Search, talks to students at the University of SalfordPhil Morgan, Head of Search, talks to students at the University of Salford
Phil Morgan, Head of Search, talks to students at the University of Salford
Delineo advertising agency
 

Similar to Haifa (20)

Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
 
Search Engine Google
Search Engine GoogleSearch Engine Google
Search Engine Google
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Evaluating search engines
Evaluating search enginesEvaluating search engines
Evaluating search engines
 
Search Engine Optimization and Analytics for CSEPP Advanced Training Course
Search Engine Optimization and Analytics for CSEPP Advanced Training CourseSearch Engine Optimization and Analytics for CSEPP Advanced Training Course
Search Engine Optimization and Analytics for CSEPP Advanced Training Course
 
Troy Henikoff - Entrepreneur's Guide to SEO
Troy Henikoff - Entrepreneur's Guide to SEOTroy Henikoff - Entrepreneur's Guide to SEO
Troy Henikoff - Entrepreneur's Guide to SEO
 
Search Engine Marketing (Oldschool) - an introduction.
Search Engine Marketing (Oldschool) - an introduction.Search Engine Marketing (Oldschool) - an introduction.
Search Engine Marketing (Oldschool) - an introduction.
 
Designing recommender system for your application
Designing  recommender system for  your applicationDesigning  recommender system for  your application
Designing recommender system for your application
 
User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...
User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...
User Interfaces and Algorithms for Fighting Phishing, at Google Tech Talk Jan...
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
ppt
pptppt
ppt
 
Basics of search engines and algorithms (1)
Basics of search engines and algorithms (1)Basics of search engines and algorithms (1)
Basics of search engines and algorithms (1)
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)Keyword research tools for Search Engine Optimisation (SEO)
Keyword research tools for Search Engine Optimisation (SEO)
 
Internal Search - The Lost Child of Web Analytics
Internal Search - The Lost Child of Web AnalyticsInternal Search - The Lost Child of Web Analytics
Internal Search - The Lost Child of Web Analytics
 
Internet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptxInternet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptx
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Search Engine Optimisation: A High Level View
Search Engine Optimisation: A High Level ViewSearch Engine Optimisation: A High Level View
Search Engine Optimisation: A High Level View
 
SEO Presentation
SEO PresentationSEO Presentation
SEO Presentation
 
Phil Morgan, Head of Search, talks to students at the University of Salford
Phil Morgan, Head of Search, talks to students at the University of SalfordPhil Morgan, Head of Search, talks to students at the University of Salford
Phil Morgan, Head of Search, talks to students at the University of Salford
 

More from Ram Dutt Shukla

Ip Sec Rev1
Ip Sec Rev1Ip Sec Rev1
Ip Sec Rev1
Ram Dutt Shukla
 
Message Authentication
Message AuthenticationMessage Authentication
Message Authentication
Ram Dutt Shukla
 
Shttp
ShttpShttp
Web Security
Web SecurityWeb Security
Web Security
Ram Dutt Shukla
 
I Pv6 Addressing
I Pv6 AddressingI Pv6 Addressing
I Pv6 Addressing
Ram Dutt Shukla
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
Ram Dutt Shukla
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
Ram Dutt Shukla
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
Ram Dutt Shukla
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
Ram Dutt Shukla
 
Tcp Congestion Avoidance
Tcp Congestion AvoidanceTcp Congestion Avoidance
Tcp Congestion Avoidance
Ram Dutt Shukla
 
Tcp Immediate Data Transfer
Tcp Immediate Data TransferTcp Immediate Data Transfer
Tcp Immediate Data Transfer
Ram Dutt Shukla
 
Tcp Reliability Flow Control
Tcp Reliability Flow ControlTcp Reliability Flow Control
Tcp Reliability Flow Control
Ram Dutt Shukla
 
Tcp Udp Notes
Tcp Udp NotesTcp Udp Notes
Tcp Udp Notes
Ram Dutt Shukla
 
Transport Layer [Autosaved]
Transport Layer [Autosaved]Transport Layer [Autosaved]
Transport Layer [Autosaved]
Ram Dutt Shukla
 
Transport Layer
Transport LayerTransport Layer
Transport Layer
Ram Dutt Shukla
 
T Tcp
T TcpT Tcp
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
Ram Dutt Shukla
 
Igmp
IgmpIgmp
Mobile I Pv6
Mobile I Pv6Mobile I Pv6
Mobile I Pv6
Ram Dutt Shukla
 
Mld
MldMld

More from Ram Dutt Shukla (20)

Ip Sec Rev1
Ip Sec Rev1Ip Sec Rev1
Ip Sec Rev1
 
Message Authentication
Message AuthenticationMessage Authentication
Message Authentication
 
Shttp
ShttpShttp
Shttp
 
Web Security
Web SecurityWeb Security
Web Security
 
I Pv6 Addressing
I Pv6 AddressingI Pv6 Addressing
I Pv6 Addressing
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
 
Tcp Congestion Avoidance
Tcp Congestion AvoidanceTcp Congestion Avoidance
Tcp Congestion Avoidance
 
Tcp Immediate Data Transfer
Tcp Immediate Data TransferTcp Immediate Data Transfer
Tcp Immediate Data Transfer
 
Tcp Reliability Flow Control
Tcp Reliability Flow ControlTcp Reliability Flow Control
Tcp Reliability Flow Control
 
Tcp Udp Notes
Tcp Udp NotesTcp Udp Notes
Tcp Udp Notes
 
Transport Layer [Autosaved]
Transport Layer [Autosaved]Transport Layer [Autosaved]
Transport Layer [Autosaved]
 
Transport Layer
Transport LayerTransport Layer
Transport Layer
 
T Tcp
T TcpT Tcp
T Tcp
 
Anycast & Multicast
Anycast & MulticastAnycast & Multicast
Anycast & Multicast
 
Igmp
IgmpIgmp
Igmp
 
Mobile I Pv6
Mobile I Pv6Mobile I Pv6
Mobile I Pv6
 
Mld
MldMld
Mld
 

Recently uploaded

The Ultimate Travel Guide to Hawaii Island Hopping in 2024
The Ultimate Travel Guide to Hawaii Island Hopping in 2024The Ultimate Travel Guide to Hawaii Island Hopping in 2024
The Ultimate Travel Guide to Hawaii Island Hopping in 2024
adventuressabifn
 
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
eovoam
 
Explore Architectural Wonders and Vibrant Culture With Naples Tours
Explore Architectural Wonders and Vibrant Culture With Naples ToursExplore Architectural Wonders and Vibrant Culture With Naples Tours
Explore Architectural Wonders and Vibrant Culture With Naples Tours
Naples Tours
 
Top 10 Tourist Places in South India to Explore.pdf
Top 10 Tourist Places in South India to Explore.pdfTop 10 Tourist Places in South India to Explore.pdf
Top 10 Tourist Places in South India to Explore.pdf
Savita Yadav
 
5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets
5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets
5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets
Parag Goswami
 
Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!
Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!
Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!
River Recreation - Washington Whitewater Rafting
 
Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...
Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...
Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...
CIOWomenMagazine
 
Discovering Egypt A Step-by-Step Guide to Planning Your Trip.ppt
Discovering Egypt A Step-by-Step Guide to Planning Your Trip.pptDiscovering Egypt A Step-by-Step Guide to Planning Your Trip.ppt
Discovering Egypt A Step-by-Step Guide to Planning Your Trip.ppt
Imperial Egypt
 
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
yfuwd
 
Un viaje a Argentina updated xxxxxxxxxxx
Un viaje a Argentina updated xxxxxxxxxxxUn viaje a Argentina updated xxxxxxxxxxx
Un viaje a Argentina updated xxxxxxxxxxx
Judy Hochberg
 
Exploring the Majesty of Nepal: An Unforgettable Tour Experience
Exploring the Majesty of Nepal: An Unforgettable Tour ExperienceExploring the Majesty of Nepal: An Unforgettable Tour Experience
Exploring the Majesty of Nepal: An Unforgettable Tour Experience
Welcome Nepal Treks and Tours
 
Excursions in Tahiti Island Adventure
Excursions in Tahiti Island AdventureExcursions in Tahiti Island Adventure
Excursions in Tahiti Island Adventure
Unique Tahiti
 
What Outdoor Adventures Await Young Adults in Montreal's Surrounding Nature
What Outdoor Adventures Await Young Adults in Montreal's Surrounding NatureWhat Outdoor Adventures Await Young Adults in Montreal's Surrounding Nature
What Outdoor Adventures Await Young Adults in Montreal's Surrounding Nature
Spade & Palacio Tours
 
Nature of the task 1. write a paragraph about your trip to dubai and what ar...
Nature of the task  1. write a paragraph about your trip to dubai and what ar...Nature of the task  1. write a paragraph about your trip to dubai and what ar...
Nature of the task 1. write a paragraph about your trip to dubai and what ar...
solutionaia
 
How To Change A Name On American Airlines Ticket.pptx
How To Change A Name On American Airlines Ticket.pptxHow To Change A Name On American Airlines Ticket.pptx
How To Change A Name On American Airlines Ticket.pptx
edqour001namechange
 
Un viaje a Buenos Aires y sus alrededores
Un viaje a Buenos Aires y sus alrededoresUn viaje a Buenos Aires y sus alrededores
Un viaje a Buenos Aires y sus alrededores
Judy Hochberg
 
bangalore metro routes, stations, timings
bangalore metro routes, stations, timingsbangalore metro routes, stations, timings
bangalore metro routes, stations, timings
narinav14
 
Discover the Magic of Ibiza An Unforgettable Boat Trip
Discover the Magic of Ibiza An Unforgettable Boat TripDiscover the Magic of Ibiza An Unforgettable Boat Trip
Discover the Magic of Ibiza An Unforgettable Boat Trip
White Island Charter
 
Understanding Bus Hire ServicesIN MELBOURNE .pptx
Understanding Bus Hire ServicesIN MELBOURNE .pptxUnderstanding Bus Hire ServicesIN MELBOURNE .pptx
Understanding Bus Hire ServicesIN MELBOURNE .pptx
MELBOURNEBUSHIRE
 
How Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdf
How Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdfHow Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdf
How Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdf
Eastafrica Travelcompany
 

Recently uploaded (20)

The Ultimate Travel Guide to Hawaii Island Hopping in 2024
The Ultimate Travel Guide to Hawaii Island Hopping in 2024The Ultimate Travel Guide to Hawaii Island Hopping in 2024
The Ultimate Travel Guide to Hawaii Island Hopping in 2024
 
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
定制(cardiff学位证书)英国卡迪夫大学毕业证本科学历原版一模一样
 
Explore Architectural Wonders and Vibrant Culture With Naples Tours
Explore Architectural Wonders and Vibrant Culture With Naples ToursExplore Architectural Wonders and Vibrant Culture With Naples Tours
Explore Architectural Wonders and Vibrant Culture With Naples Tours
 
Top 10 Tourist Places in South India to Explore.pdf
Top 10 Tourist Places in South India to Explore.pdfTop 10 Tourist Places in South India to Explore.pdf
Top 10 Tourist Places in South India to Explore.pdf
 
5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets
5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets
5-Day Nathdwara Tour Itinerary: From Temples to Traditional Markets
 
Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!
Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!
Ready for Cold Weather Rafting Here's What to Wear to Stay Comfortable!
 
Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...
Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...
Golden Gate Bridge: Magnificent Architecture in San Francisco | CIO Women Mag...
 
Discovering Egypt A Step-by-Step Guide to Planning Your Trip.ppt
Discovering Egypt A Step-by-Step Guide to Planning Your Trip.pptDiscovering Egypt A Step-by-Step Guide to Planning Your Trip.ppt
Discovering Egypt A Step-by-Step Guide to Planning Your Trip.ppt
 
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
一比一原版(UST毕业证)圣托马斯大学毕业证如何办理
 
Un viaje a Argentina updated xxxxxxxxxxx
Un viaje a Argentina updated xxxxxxxxxxxUn viaje a Argentina updated xxxxxxxxxxx
Un viaje a Argentina updated xxxxxxxxxxx
 
Exploring the Majesty of Nepal: An Unforgettable Tour Experience
Exploring the Majesty of Nepal: An Unforgettable Tour ExperienceExploring the Majesty of Nepal: An Unforgettable Tour Experience
Exploring the Majesty of Nepal: An Unforgettable Tour Experience
 
Excursions in Tahiti Island Adventure
Excursions in Tahiti Island AdventureExcursions in Tahiti Island Adventure
Excursions in Tahiti Island Adventure
 
What Outdoor Adventures Await Young Adults in Montreal's Surrounding Nature
What Outdoor Adventures Await Young Adults in Montreal's Surrounding NatureWhat Outdoor Adventures Await Young Adults in Montreal's Surrounding Nature
What Outdoor Adventures Await Young Adults in Montreal's Surrounding Nature
 
Nature of the task 1. write a paragraph about your trip to dubai and what ar...
Nature of the task  1. write a paragraph about your trip to dubai and what ar...Nature of the task  1. write a paragraph about your trip to dubai and what ar...
Nature of the task 1. write a paragraph about your trip to dubai and what ar...
 
How To Change A Name On American Airlines Ticket.pptx
How To Change A Name On American Airlines Ticket.pptxHow To Change A Name On American Airlines Ticket.pptx
How To Change A Name On American Airlines Ticket.pptx
 
Un viaje a Buenos Aires y sus alrededores
Un viaje a Buenos Aires y sus alrededoresUn viaje a Buenos Aires y sus alrededores
Un viaje a Buenos Aires y sus alrededores
 
bangalore metro routes, stations, timings
bangalore metro routes, stations, timingsbangalore metro routes, stations, timings
bangalore metro routes, stations, timings
 
Discover the Magic of Ibiza An Unforgettable Boat Trip
Discover the Magic of Ibiza An Unforgettable Boat TripDiscover the Magic of Ibiza An Unforgettable Boat Trip
Discover the Magic of Ibiza An Unforgettable Boat Trip
 
Understanding Bus Hire ServicesIN MELBOURNE .pptx
Understanding Bus Hire ServicesIN MELBOURNE .pptxUnderstanding Bus Hire ServicesIN MELBOURNE .pptx
Understanding Bus Hire ServicesIN MELBOURNE .pptx
 
How Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdf
How Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdfHow Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdf
How Do I Plan a Kilimanjaro Climb? 7 Essential Tips Revealed.pdf
 

Haifa

  • 1. Challenges in Running a Commercial Web Search Engine Amit Singhal
  • 2. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google
  • 3. Introduction • Crawling – Follow links to find information • Indexing – Record what words appear where • Ranking – What information is a good match to a user query? – What information is inherently good? • Displaying – Find a good format for the information • Serving – Handle queries, find pages, display results
  • 4. History • The web happened (1992) • Mosaic/Netscape happened (1993-95) • Crawler happened (1994): M. Mauldin • SEs happened 1994-1996 – InfoSeek, Lycos, Altavista, Excite, Inktomi, … • Yahoo decided to go with a directory • Google happened 1996-98 – Tried selling technology to other engines – SEs though search was a commodity, portals were in • Microsoft said: whatever …
  • 5. Present • Most search engines have vanished • Google is a big player • Yahoo decided to de-emphasize directories – Buys three search engines • Microsoft realized Internet is here to stay – Dominates the browser market – Realizes search is critical
  • 6. History • Early systems Information Retrieval based – Infoseek, Altavista, … • Information Retrieval – Field started in the 1950s – Primarily focused on text search – Already had written-off directories (1960s) – Mostly uses statistical methods to analyze text
  • 7. History • IR necessary but not sufficient for web search • Doesn’t capture authority – Same article hosted on BBC as good as a slightly modified copy on john-doe-news.com • Doesn’t address web navigation – Query ibm seeks www.ibm.com – To IR www.ibm.com may look less topical than a quarterly report
  • 8. History • But there are links – Long history in citation analysis – Navigational tools on the web – Also a sign of popularity – Can be thought of as recommendations (source recommends destination) – Also describe the destination: anchor text
  • 9. History • Link analysis – Hubs and authority (Jon Kleinberg) • Topical links exploited • Query time approach – PageRank (Brin and Page) • Computed on the entire graph • Query independent • Faster if serving lots of queries – Others…
  • 10. History • Google showed link analysis can make a huge difference and is practical too – Everyone else followed • Then there is the secret sauce – Link analysis – Information retrieval – Anchor text – Other stuff
  • 11. History • Interfaces – Many alternatives existed/exist • Simple ranked list • Keywords in context snippets (Google first SE to do this) • Topics/query suggestion tools (e.g. Vivisimo, Teoma) • Graphical, 2-D, 3-D – Simple and clean preferred by users • Like relevance ranking • Like keywords in context snippets
  • 12. End Product • As of today – Users give a 2-4 word query – SE gives a relevance ranked list of web pages – Most users click only on the first few results – Few users go below the fold • Whatever is visible without scrolling down – Far fewer ask for the next 10 results
  • 13. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google
  • 14. Oh No … This is REAL • 80% of users use search engines to find sites
  • 15. Enter the Greedy Spammer • Users follow search results • Money follows users, spam follows … • There is value in getting ranked high – Affiliate programs • Siphon traffic from SEs to Amazon/eBay/… – Make a few bucks • Siphon traffic from SEs to a Viagra seller – Make $6 per sale • Siphon traffic from SEs to a porn site – Make $20-$40 per new member
  • 16. Big Money • Let’s do the math • How much can the spam industry make by spamming search engines? – Assume 500M searches/day on the web • All search engines combined – Assume 5% commercially viable • Much more if you include porn queries – Assume $0.50 made per click (from 5c to $40) – $12.5M/day or about $4.5 Billion/year
  • 17. How? • Defeat IR – Keyword stuffing – Crawlers declare that it is a SE spider – They dish us an “optimized” page
  • 18. But that should be easy… • Just detect keyword density
  • 19. But that is easy too… • Just detect that page is not about query
  • 20. Legitimate NLP Parse • Noun phrase to noun phrase
  • 21. But links should help… • No one should link to these bad sites – Expired domains • The owner of a legitimate domain doesn’t renew it • Spammers grab it, it already has tons of incoming links • E.g., anchor text for – The War on Freedom – The War on Freedom: How and Why America was attacked – The War on Freedom
  • 25. State of Affairs • There is big money in spamming SEs • Easy to get links from good sites • Easy to generate search algorithm friendly pages • Any technique can be and will be attacked by spammers • Have to make sense out of this chaos
  • 26. We counter it well • Most SEs are still very useful – Used over 500 million times every day • All search engines put together • Our internal measurements show that we are winning • Still need to be watchful
  • 28. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google
  • 29. Information Retrieval • Test collection paradigm of evaluation – Static collection of documents (few million) – A set of queries (around 50-100) – Relevance judgments – Extensive judgments not possible (100x1,000,000) – Use pooling • Pool top 1000 results from various techniques • Assume all possible relevant documents judged • Biased against revolutionary new methods – Judge new documents if needed
  • 30. On the Web • Collection is dynamic – 10-20% urls change every month – Spam methods are dynamic – Need to keep the collection recent • Queries are also time sensitive – Topics are hot then not – Need to keep a representative sample
  • 31. On the Web • Search space is HUGE – Over 200 million queries a day – Over 100 million are unique – Need 2700 queries for a 5% (700 for 10%) improvement to be meaningful at 95% confidence • Search space is varied – Serve 90 different languages – Can’t have a catastrophic failure in any – Monitoring every part of the system is non-trivial • IR style evaluation – Incredibly expensive – Always out of date
  • 32. On the Web • But what about user behavior? – You can use clicks as supervision. • Clicks – Incredibly noisy – A click on a result does not mean a vote for it • The destination may just be a traffic peddler • User taken to some other site • If anything, this (clicked) result was BAD
  • 33. Blue and Gold Fleet
  • 34. We do Very Well • Continually evaluate our system – In multiple languages – Tests valid over large traffic – Caught many possible disasters • Constantly launch changes/products – Stemming, Google News, Froogle, Usenet, …
  • 35. Overview • Introduction/History • Search Engine Spam • Evaluation Challenge • Google – Finding Needles in a 20 TB Haystack, 200M times per day
  • 36. Past 1995 research project at Stanford University
  • 37. Lego Disk Case One of our earliest storage systems
  • 39. Growth • Nov. 98: 10,000 queries on 25 computers • Apr. 99: 500,000 queries on 300 computers • Sept. 99: 3M queries on 2,100 computers
  • 41. Datacenters now And 3 days later…
  • 42. Where the users are…
  • 43. What can we learn… • Structure of Web • Interests of Users • Trends and Fads • Languages • Concepts • Relationships
  • 45. Google • Ethics – No pay for inclusion (in index) – No pay for placement (in ranking) – Clearly demarked results and ads – 20% engineer time doing random stuff • Out came news, froogle, orkut – Users come first