The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
Group Members
• Laksri Wijerathna
• Harini Sirisena
• Himali Erangika
• Erica Jayasundara
What is this paper is about..?
• Presents Google, a prototype of a large-scale
search engine.
• This paper was presented by Sergey Brin and
Lawrence Page who were PhD students at
Stanford University, USA
Outline
• Problem
• Solution
• Results
• Conclusion
• Future work
• Comparison with today
• Q&A
Problem Description..
Problems Addressed
• Address the question “How to build a practical
large-scale system which can exploit the
additional information present in hypertext”.
• Also addresses “How to effectively deal with
uncontrolled hypertext collections where any
one can publish anything they want”.
Problems with Automated Search
Engines
• Automated search engines that rely on keyword
matching usually return too many low quality
matches.
• Some advertisers attempt to gain people’s
attention by taking measures meant to mislead
automated search engines.
Web Search Engine Scaling up
(1994-2000)
• Web Search engines has to scale to keep up with
the growth of the web.
• 1994 WWWW ( World Wide Web Worm) one of
the first web search engines had an index of
110,000 web pages and web access documents.
• 1994 WWWW handled 1500 queries per day.
• 1997 Altavista roughly handled 20 million
queries per day.
Goal of Proposed System
• To address many of the problems, both in
quality and scalability
Challenges for Google to Scaling with
the Web
• Fast crawling technology is needed to gather the
web documents and keep them up to date.
• Storage space must be used efficiently to store
indices and documents
• Indexing system must process hundreds of
gigabytes of data efficiently.
• Queries must be handled quickly, at a rate of
hundreds to thousands per day
Problems for Google to scaling with
the Web
• Tasks are becoming increasingly difficult as the
Web grows.
• Disk seek time and Operating System robustness
can not be avoided.
Design Goals Background
• In 1994 some people believed that a complete
search index would make it possible to find
anything else.
• But Index is not the only factor in the quality
search results, “Junk results” can wash out any
results that a user is interested in.
Design Goals- Main Cause
• Main cause of this problem is that the number of
documents in the indices has been increasing by
many orders of magnitude, but the user’s ability
to look at documents has not.- People only look
at first tens of results.
• Because of this as the collection size grows, it
need to have tools with high precision where it
can select relevant documents in the top ten
results.
Design Goals
• To push more development and understanding
into the academic realm
• To Build a system that reasonable numbers of
people can actually use.
• To Build an architecture that can support novel
research activities on large-scale web data.
• To set up a Spacelab-like environment where
researchers or even students can propose and do
interesting experiments.
Solution described..
Google System Features
• Make use of link structure of the Web to
calculate a quality ranking for each web page. –
PageRank
• Intuitive Justification
• Utilizes links to improve the search results –
Anchor Text.
PageRank
• Link analysis algorithm, named after Larry Page.
• Assigns a numerical weight to each element of a
hyperlinked set of documents.
• Gives higher rank to pages that are linked to
from others(popularity based).
How the PageRank works
• To get the approximation of a page’s importance
or quality we can count the citations or backlink
to a given page.
• PageRank extends this idea by not counting
links from all pages equally, but by normalizing
by the number of links on a page.
PageRank Definition
• PR(A) = (1-d) + d(PR(T1)/ C(T1)+ ..+PR(Tn)/ C(Tn))
• Assume page A has pages T1..Tn which point to A
• d – damping factor which can be set between 0 -1. Usually set to
0.85
• C(T1) – Number of links going out of Page T1
 PageRank forms a Probability distribution over web
pages, so the sum of all web pages’ PageRank will be one.
PageRank Definition cont
• PageRank is the probability that the random
surfer visits a page.
(“random surfer” – the person who is given a web
page at random and keeps clicking on links, never
hitting “back” but eventually get bored and starts on
another random page)
• d – damping factor is the probability at each
page that random surfer will get bored and
request group of pages.
Intuitive Justification
PageRank can be higher if;
• There are many pages that point to it.
• There are pages that point to it and have a high
PageRank.
• The pages are well cited from many places
around the web.
• The pages that have perhaps only one citation
from something like Yahoo.
Anchor Text
• Is the visible, clickable text in the hyperlink.
• Anchor Text are treated in a special way in
Google.
• Advantages:
◦ Provides more accurate descriptions of web pages
than the pages themselves.
◦ Anchors may exist for documents which can not
be indexed by a text-base search engine, such as
images, programs and databases. This make it
possible to return web pages which have not
actually ben crawled.
Google - other features
• Has the location information for all hits and so it
makes extensive use of proximity in search
• Keep track of some visual presentation details
such as font size of words – fonts in larger or
bolder font are weighted higher than others.
• Full raw HTML of pages is available in a
repository.
Google Architecture
Major Data Structures
• BigFiles
• Repository
• Document Index
• Lexicon
• Hit Lists
• Forward Index
• Inverted Index
BigFiles
• Virtual files spanning multiple file systems.
• Addressable by 64 bit integers.
• Allocation among multiple files handled
automatically.
• Handles allocation and deallocation of file
descriptors.
• Also support rudimentary compression options.
Repository
• Contains full HTML of every web page.
• Each page is compressed using zlib
• Choice of compression technique is a tradeoff
between speed and compression ratio.
• The documents are stored one after the other
and prefixed by docID, length, and URL.
• Repository does not requires other data
structures to be used to access it.
Document Index
• Keeps information about each document
• Information Includes
▫ Current document status
▫ Pointer into the repository
▫ Document checksum
▫ Various Statistics
• Is fixed with ISAM(Index sequential access
mode) index, ordered by docID.
• The crawled documents contains a pointer into a
variable width file called “docinfo” which
contains the URL and title.
Lexicon
• Current implementation keep the lexicon in
memory on a machine with 256 MB of main
memory contains 14 million words which make
it easy to fit in memory for a reasonable price.
• Lexicon Implementation has two parts
1. A list of the words
2. Hash table of pointers
Hit Lists
• A list of occurrences of each word in a particular
document
▫ – Position
▫ – Font
▫ – Capitalization
• The hit list accounts for most of the space used in
both indices
• Uses a special compact encoding algorithm
▫ – Requiring only 2 bytes for each hit
• The hit lists are very important in calculating the
Rank of a page
• There are two different types of hits:
Hit Lists cont
• Plain Hits:
▫ – Capitalization bit
▫ – Font Size (relative to the rest of the page) -> 3 bits
▫ – Word Position in document -> 12 bits
• Fancy Hits (found in URL, title, anchor text, or meta tag)
▫ – Capitalization bit
▫ – Font Size - set to 7 to indicate Fancy Hit -> 3 bits
▫ – Type of fancy hit -> 4 bits
▫ – Position -> 8 bits
• If the type of fancy hit is an anchor, the Position is split:
▫ – 4 bits for position in anchor
▫ – 4 bits for hash of the DocID the anchor occurs in
• The length of the hit list is stored before the hits
themselves
Forward Index
• Stored in (64)barrels containing:
A range of WordID's, The DocID of a pages
containing these words, A list of WordID's
followed by corresponding hit lists, actual
WordID's are not stored in the barrels; instead,
the difference between the word and the
minimum of the barrel is stored, This requires
only 24 bits for each WordID, allowing 8 bits to
hold the hit list length
Inverted Index
Contains the same barrels as the Forward
Index,except they have been sorted by docID’s,
all words are pointed to by the Lexicon, contains
pointers to a doclist containing all docID’s with
their corresponding hit lists.
• – The barrels are duplicated
• – For speed in single word searches
Crawling the web
• 1 URL server.
• 3 crawler machines.
• Both URL server and crawler implemented in
python.
• Each crawler keeps up to 300 connections open
at once.
• At peak speeds crawled up to 100pages/600K
per sec (using 4 crawlers)
Crawler actions
1. Looking up DNS
2. Connecting to a host
3. Sending request
4. Receiving response
• Note – each crawler maintains a DNS cache to
speedup DNS lookup process
Issues
• Large amount of email/calls
• Accidently crawling an online game
• Robots exclusion protocol not followed
Indexing the web
1. Parsing – custom flex parser (robust, speed)
2. Indexing documents into barrels – hit list and
forward index constructed.
3. Sorting – done 1 barrel at a time, but multiple
sorters run in parallel. Constructs short
inverted barrel and full inverted barrel.
Searching
1. Parse query
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short
barrel for every word.
4. Scan through the doclist until there is a
document that matches all the terms.
Searching contd..
5. Compute the rank of that document, for that query
6. If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full
barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to step 4.
8. Sort the documents that have matched by rank
and return the top k (40K for efficiency)
Note – 8) may cause suboptimal results. This can be
avoided by ranking hits.
Ranking system
• Example of a single word search :
- Type weights vector (different weight for different text types
such as title, anchor, URL, large text etc).
- Count weights vector (weight for no of counts, increases
linearly and then tapers off)
- Score dot product of count weights vector and type-weights
vector
- This score is combined with page rank for final score for the
page.
Ranking system contd..
• For multi words, word proximity is considered
(10 diff classifications from same phrase to not
even close). Now a type-prox-weight vector is
made.
Improving using feedback
• Uses feedback to figure out type-weights and
type-prox-weights.
• A trusted user evaluates the search results and
gives feedback. Weight vectors are modified
accordingly.
Results..
Search results
Storage
System performance
• Once the system was running smoothly it
downloaded 11million pages in 63hours(about
48.5 pages per sec).
• Indexer is just faster than the crawler to prevent
bottleneck.
• Indexer ran roughly 54pages per sec. Sorters ran
parallel, using 4 machines the process takes
about 24hours.
Web page statistics
Search performance
• Focus was on improving search results not
performance.
• Most queries answered in 1-10secs. This is
mainly due to disk I/O since disks are spread
over multiple machines.
• No optimizations like query caching, sub-indices
on common terms etc.
• Can be speeded up with distribution and
algorithm improvements.
Performance data
Conclusion..
Conclusion
• Google was designed to..
1. Be a scalable search engine.
2. To provide high quality search results
Considering the success of the Google search
engine we can conclude that the original goals
have been met to a very high degree.
Summary of Key Optimization
Techniques
• Each crawler maintains its own DNS lookup
cache.
• Use flex to generate lexical analyzer with own
stack for parsing documents.
• Parallelization of indexing phase.
• In-memory lexicon.
Summary of Key Optimization
Techniques contd..
• Compression of repository.
• Compact encoding of hitlists accounting for
major space savings.
• Document index is updated in bulk.
• Critical data structures placed on local disk.
Future work..
Future work
• Scale to approximately 100 million web pages
• Query efficiency handling by
▫ Query caching
▫ Smart disk allocation
▫ Sub indices
• Proxy caches to build search databases
Future work contd..
• Improvement from commercial search engine
such as
▫ Boolean operator
▫ Negation
▫ Stemming
• Support of user context and result
summarization
• Increase the weight of bookmarks
• Support of text surrounding links in addition to
the link text itself
Comparison with today..
Google improvements
• Google File System - 2003
• MapReduce (Simplied Data Processing on Large
Clusters) - 2008
• BigTable (A Distributed Storage System for
Structured Data) - 2008
• Percolator (Large-scale Incremental Processing
Using Distributed Transactions and
Notifications) - 2010
Google File System
• Google's core data storage and usage needs
• An improvement of BigFIles Developed by Larry
Page and Sergey Brin
• Designed for system-to-system interaction, and
not for user-to-system interaction. The chunk
servers replicate the data automatically.
Architecture of GFS
• Files are divided into chunks
• Fixed-size chunks (64MB)
• Replicated over chunkservers, called replicas
• Unique 64-bit chunk handles
• Chunks as Linux files
Advantages of GFS
• Scalability
• autonomic computing ( a concept in which
computers are able to diagnose problems and
solve them in real time without the need for
human intervention)
• Chunk handle (files up into chunks of 64
megabytes (MB) )
• High availability and component failure
• Fault tolerance
MapReduce
Typical Challenges in Parallel Processing:
• How to assign tasks to the workers?
• What if we have more tasks than workers?
• What if the workers need to share partial
results?
• How do we aggregate partial results?
• How do we know whether the workers have
finished?
• What if the workers fail?
MapReduce contd..
Functional Programming
+
Distributed Processing Platform
Percolator
• Google’s indexing system stores tens of
petabytes of data and processes billions of
updates per day on thousands of machines.
• MapReduce and other batch-processing systems
cannot process small updates individually as
they rely on creating large batches for efficiency.
Percolator contd..
• Percolator is a system for incrementally
processing updates to a large data set, and
deployed to create the Google web search index.
• By replacing a batch-based indexing system with
an indexing system based on incremental
processing it is possible to process the same
number of documents per day, while reducing
the average age of documents in Google search
results by 50%.
Google Challenges today..
• SEO(Search Engine Optimization) techniques
used by advertisers are adversely affecting
results of users’ queries.SEO supposedly
improve page rankings using SEO tools.
Google Challenges today contd..
• Filter bubbles - Customized results based on the
user's activity history.
• algorithms are written to selectively guess what
information a user would like to see, based on
information about the user
1. location,
2. past click behavior,
3. search history.
Modern Search Engine Capacities
New application areas of search
engines
• Image retrieval search engine over Content-
based image retrieval (CBIR)
• Audio search engine
• Video search engine
• Semantic search engine
• Enterprise search engine
Q&A
Thank you..

DC presentation 1

  • 1.
    The Anatomy ofa Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page
  • 2.
    Group Members • LaksriWijerathna • Harini Sirisena • Himali Erangika • Erica Jayasundara
  • 3.
    What is thispaper is about..? • Presents Google, a prototype of a large-scale search engine. • This paper was presented by Sergey Brin and Lawrence Page who were PhD students at Stanford University, USA
  • 4.
    Outline • Problem • Solution •Results • Conclusion • Future work • Comparison with today • Q&A
  • 5.
  • 6.
    Problems Addressed • Addressthe question “How to build a practical large-scale system which can exploit the additional information present in hypertext”. • Also addresses “How to effectively deal with uncontrolled hypertext collections where any one can publish anything they want”.
  • 7.
    Problems with AutomatedSearch Engines • Automated search engines that rely on keyword matching usually return too many low quality matches. • Some advertisers attempt to gain people’s attention by taking measures meant to mislead automated search engines.
  • 8.
    Web Search EngineScaling up (1994-2000) • Web Search engines has to scale to keep up with the growth of the web. • 1994 WWWW ( World Wide Web Worm) one of the first web search engines had an index of 110,000 web pages and web access documents. • 1994 WWWW handled 1500 queries per day. • 1997 Altavista roughly handled 20 million queries per day.
  • 9.
    Goal of ProposedSystem • To address many of the problems, both in quality and scalability
  • 10.
    Challenges for Googleto Scaling with the Web • Fast crawling technology is needed to gather the web documents and keep them up to date. • Storage space must be used efficiently to store indices and documents • Indexing system must process hundreds of gigabytes of data efficiently. • Queries must be handled quickly, at a rate of hundreds to thousands per day
  • 11.
    Problems for Googleto scaling with the Web • Tasks are becoming increasingly difficult as the Web grows. • Disk seek time and Operating System robustness can not be avoided.
  • 12.
    Design Goals Background •In 1994 some people believed that a complete search index would make it possible to find anything else. • But Index is not the only factor in the quality search results, “Junk results” can wash out any results that a user is interested in.
  • 13.
    Design Goals- MainCause • Main cause of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user’s ability to look at documents has not.- People only look at first tens of results. • Because of this as the collection size grows, it need to have tools with high precision where it can select relevant documents in the top ten results.
  • 14.
    Design Goals • Topush more development and understanding into the academic realm • To Build a system that reasonable numbers of people can actually use. • To Build an architecture that can support novel research activities on large-scale web data. • To set up a Spacelab-like environment where researchers or even students can propose and do interesting experiments.
  • 15.
  • 16.
    Google System Features •Make use of link structure of the Web to calculate a quality ranking for each web page. – PageRank • Intuitive Justification • Utilizes links to improve the search results – Anchor Text.
  • 17.
    PageRank • Link analysisalgorithm, named after Larry Page. • Assigns a numerical weight to each element of a hyperlinked set of documents. • Gives higher rank to pages that are linked to from others(popularity based).
  • 18.
    How the PageRankworks • To get the approximation of a page’s importance or quality we can count the citations or backlink to a given page. • PageRank extends this idea by not counting links from all pages equally, but by normalizing by the number of links on a page.
  • 19.
    PageRank Definition • PR(A)= (1-d) + d(PR(T1)/ C(T1)+ ..+PR(Tn)/ C(Tn)) • Assume page A has pages T1..Tn which point to A • d – damping factor which can be set between 0 -1. Usually set to 0.85 • C(T1) – Number of links going out of Page T1  PageRank forms a Probability distribution over web pages, so the sum of all web pages’ PageRank will be one.
  • 20.
    PageRank Definition cont •PageRank is the probability that the random surfer visits a page. (“random surfer” – the person who is given a web page at random and keeps clicking on links, never hitting “back” but eventually get bored and starts on another random page) • d – damping factor is the probability at each page that random surfer will get bored and request group of pages.
  • 21.
    Intuitive Justification PageRank canbe higher if; • There are many pages that point to it. • There are pages that point to it and have a high PageRank. • The pages are well cited from many places around the web. • The pages that have perhaps only one citation from something like Yahoo.
  • 22.
    Anchor Text • Isthe visible, clickable text in the hyperlink. • Anchor Text are treated in a special way in Google. • Advantages: ◦ Provides more accurate descriptions of web pages than the pages themselves. ◦ Anchors may exist for documents which can not be indexed by a text-base search engine, such as images, programs and databases. This make it possible to return web pages which have not actually ben crawled.
  • 23.
    Google - otherfeatures • Has the location information for all hits and so it makes extensive use of proximity in search • Keep track of some visual presentation details such as font size of words – fonts in larger or bolder font are weighted higher than others. • Full raw HTML of pages is available in a repository.
  • 24.
  • 25.
    Major Data Structures •BigFiles • Repository • Document Index • Lexicon • Hit Lists • Forward Index • Inverted Index
  • 26.
    BigFiles • Virtual filesspanning multiple file systems. • Addressable by 64 bit integers. • Allocation among multiple files handled automatically. • Handles allocation and deallocation of file descriptors. • Also support rudimentary compression options.
  • 27.
    Repository • Contains fullHTML of every web page. • Each page is compressed using zlib • Choice of compression technique is a tradeoff between speed and compression ratio. • The documents are stored one after the other and prefixed by docID, length, and URL. • Repository does not requires other data structures to be used to access it.
  • 28.
    Document Index • Keepsinformation about each document • Information Includes ▫ Current document status ▫ Pointer into the repository ▫ Document checksum ▫ Various Statistics • Is fixed with ISAM(Index sequential access mode) index, ordered by docID. • The crawled documents contains a pointer into a variable width file called “docinfo” which contains the URL and title.
  • 29.
    Lexicon • Current implementationkeep the lexicon in memory on a machine with 256 MB of main memory contains 14 million words which make it easy to fit in memory for a reasonable price. • Lexicon Implementation has two parts 1. A list of the words 2. Hash table of pointers
  • 30.
    Hit Lists • Alist of occurrences of each word in a particular document ▫ – Position ▫ – Font ▫ – Capitalization • The hit list accounts for most of the space used in both indices • Uses a special compact encoding algorithm ▫ – Requiring only 2 bytes for each hit • The hit lists are very important in calculating the Rank of a page • There are two different types of hits:
  • 31.
    Hit Lists cont •Plain Hits: ▫ – Capitalization bit ▫ – Font Size (relative to the rest of the page) -> 3 bits ▫ – Word Position in document -> 12 bits • Fancy Hits (found in URL, title, anchor text, or meta tag) ▫ – Capitalization bit ▫ – Font Size - set to 7 to indicate Fancy Hit -> 3 bits ▫ – Type of fancy hit -> 4 bits ▫ – Position -> 8 bits • If the type of fancy hit is an anchor, the Position is split: ▫ – 4 bits for position in anchor ▫ – 4 bits for hash of the DocID the anchor occurs in • The length of the hit list is stored before the hits themselves
  • 32.
    Forward Index • Storedin (64)barrels containing: A range of WordID's, The DocID of a pages containing these words, A list of WordID's followed by corresponding hit lists, actual WordID's are not stored in the barrels; instead, the difference between the word and the minimum of the barrel is stored, This requires only 24 bits for each WordID, allowing 8 bits to hold the hit list length
  • 33.
    Inverted Index Contains thesame barrels as the Forward Index,except they have been sorted by docID’s, all words are pointed to by the Lexicon, contains pointers to a doclist containing all docID’s with their corresponding hit lists. • – The barrels are duplicated • – For speed in single word searches
  • 35.
    Crawling the web •1 URL server. • 3 crawler machines. • Both URL server and crawler implemented in python. • Each crawler keeps up to 300 connections open at once. • At peak speeds crawled up to 100pages/600K per sec (using 4 crawlers)
  • 36.
    Crawler actions 1. Lookingup DNS 2. Connecting to a host 3. Sending request 4. Receiving response • Note – each crawler maintains a DNS cache to speedup DNS lookup process
  • 37.
    Issues • Large amountof email/calls • Accidently crawling an online game • Robots exclusion protocol not followed
  • 38.
    Indexing the web 1.Parsing – custom flex parser (robust, speed) 2. Indexing documents into barrels – hit list and forward index constructed. 3. Sorting – done 1 barrel at a time, but multiple sorters run in parallel. Constructs short inverted barrel and full inverted barrel.
  • 39.
    Searching 1. Parse query 2.Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclist until there is a document that matches all the terms.
  • 40.
    Searching contd.. 5. Computethe rank of that document, for that query 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. 8. Sort the documents that have matched by rank and return the top k (40K for efficiency) Note – 8) may cause suboptimal results. This can be avoided by ranking hits.
  • 41.
    Ranking system • Exampleof a single word search : - Type weights vector (different weight for different text types such as title, anchor, URL, large text etc). - Count weights vector (weight for no of counts, increases linearly and then tapers off) - Score dot product of count weights vector and type-weights vector - This score is combined with page rank for final score for the page.
  • 42.
    Ranking system contd.. •For multi words, word proximity is considered (10 diff classifications from same phrase to not even close). Now a type-prox-weight vector is made.
  • 43.
    Improving using feedback •Uses feedback to figure out type-weights and type-prox-weights. • A trusted user evaluates the search results and gives feedback. Weight vectors are modified accordingly.
  • 44.
  • 45.
  • 46.
  • 47.
    System performance • Oncethe system was running smoothly it downloaded 11million pages in 63hours(about 48.5 pages per sec). • Indexer is just faster than the crawler to prevent bottleneck. • Indexer ran roughly 54pages per sec. Sorters ran parallel, using 4 machines the process takes about 24hours.
  • 48.
  • 49.
    Search performance • Focuswas on improving search results not performance. • Most queries answered in 1-10secs. This is mainly due to disk I/O since disks are spread over multiple machines. • No optimizations like query caching, sub-indices on common terms etc. • Can be speeded up with distribution and algorithm improvements.
  • 50.
  • 51.
  • 52.
    Conclusion • Google wasdesigned to.. 1. Be a scalable search engine. 2. To provide high quality search results Considering the success of the Google search engine we can conclude that the original goals have been met to a very high degree.
  • 53.
    Summary of KeyOptimization Techniques • Each crawler maintains its own DNS lookup cache. • Use flex to generate lexical analyzer with own stack for parsing documents. • Parallelization of indexing phase. • In-memory lexicon.
  • 54.
    Summary of KeyOptimization Techniques contd.. • Compression of repository. • Compact encoding of hitlists accounting for major space savings. • Document index is updated in bulk. • Critical data structures placed on local disk.
  • 55.
  • 56.
    Future work • Scaleto approximately 100 million web pages • Query efficiency handling by ▫ Query caching ▫ Smart disk allocation ▫ Sub indices • Proxy caches to build search databases
  • 57.
    Future work contd.. •Improvement from commercial search engine such as ▫ Boolean operator ▫ Negation ▫ Stemming • Support of user context and result summarization • Increase the weight of bookmarks • Support of text surrounding links in addition to the link text itself
  • 58.
  • 59.
    Google improvements • GoogleFile System - 2003 • MapReduce (Simplied Data Processing on Large Clusters) - 2008 • BigTable (A Distributed Storage System for Structured Data) - 2008 • Percolator (Large-scale Incremental Processing Using Distributed Transactions and Notifications) - 2010
  • 60.
    Google File System •Google's core data storage and usage needs • An improvement of BigFIles Developed by Larry Page and Sergey Brin • Designed for system-to-system interaction, and not for user-to-system interaction. The chunk servers replicate the data automatically.
  • 61.
    Architecture of GFS •Files are divided into chunks • Fixed-size chunks (64MB) • Replicated over chunkservers, called replicas • Unique 64-bit chunk handles • Chunks as Linux files
  • 63.
    Advantages of GFS •Scalability • autonomic computing ( a concept in which computers are able to diagnose problems and solve them in real time without the need for human intervention) • Chunk handle (files up into chunks of 64 megabytes (MB) ) • High availability and component failure • Fault tolerance
  • 64.
    MapReduce Typical Challenges inParallel Processing: • How to assign tasks to the workers? • What if we have more tasks than workers? • What if the workers need to share partial results? • How do we aggregate partial results? • How do we know whether the workers have finished? • What if the workers fail?
  • 65.
  • 66.
    Percolator • Google’s indexingsystem stores tens of petabytes of data and processes billions of updates per day on thousands of machines. • MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.
  • 67.
    Percolator contd.. • Percolatoris a system for incrementally processing updates to a large data set, and deployed to create the Google web search index. • By replacing a batch-based indexing system with an indexing system based on incremental processing it is possible to process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.
  • 68.
    Google Challenges today.. •SEO(Search Engine Optimization) techniques used by advertisers are adversely affecting results of users’ queries.SEO supposedly improve page rankings using SEO tools.
  • 69.
    Google Challenges todaycontd.. • Filter bubbles - Customized results based on the user's activity history. • algorithms are written to selectively guess what information a user would like to see, based on information about the user 1. location, 2. past click behavior, 3. search history.
  • 70.
  • 71.
    New application areasof search engines • Image retrieval search engine over Content- based image retrieval (CBIR) • Audio search engine • Video search engine • Semantic search engine • Enterprise search engine
  • 72.
  • 73.

Editor's Notes

  • #39 Indexer - after each document is parsed, a lexicon is used to convert words to wordIDs. Then a hit list is constructed for every word and written into the ‘forward barrels’. Words not in the lexicon are logged(shared log, separate lexicons per crawler). The logged words are finally indexed by a final indexer. Forward barrel maps – docid ->wordids(gives n hits per word ids in a doc) Sorter - Barrel wont fit into full memory so baskets sorted and written to short and full inverted barrels.
  • #69 Note – black hat and white hat SEO.
  • #70 Notes – Google+, social web