1. Building Your Own Search Engine From
Scratch
What Underlies Modern Search? Simple Concepts.
How many times do you search the web daily? 5, 20, 50? If Google is
your search engine of choice, you can look at your history of searches
here.
Despite how deeply search underlies our daily activities and
interaction with the world, few of us understand how it works. In this
post, I work to illuminate the underpinnings of search. This is from
implementing a search engine, based on the original Google
implementation.
CLICK HERE FOR COMPLETE CODE
2. Photo by Benjamin Dada on Unsplash
First, we will look at a preliminary step: understanding web
servers. What is a client-server infrastructure? How does your
computer connect to a website?
You will get to see what happens when a search engine connects with
your computer, a website, or anything else.
3. Then, we will look through the three base parts of a search
engine: the crawler, indexer, and PageRank algorithm. Each of
these will give you a rich understanding of the connections that make
up the spider web that is the internet.
Finally, we will go through how those components are combined to
produce our final prize: the search engine! Ready to dive in?
Let’s go!
Part 0: Web Servers
The mighty web server! A web server is what your computer contacts
every time you search a URL in your browser. Your browser acts as a
client, sending a request, similar to a business client. The server is the
sales representative who takes all those requests, processing them in
parallel.
4. The requests are text. The server knows how to read them, as it is
expecting them in a specific structure (the most common
protocol/structure now is HTTP/1.1).
A sample request:
GET /hello HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.sample-why-david-y.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: user=why-david-y
5. The request can have arguments, given by its list of cookies. It may
have a body with more information. The response follows a similar
format, allowing the server to return arguments and a body for clients
to read. With the internet becoming more interactive, all the hard
work of generating content is done on the server.
If you ever want to program a web server or client, there are many
libraries available to you to do most of the parsing and low-level work.
Just keep in mind that client requests and server responses are just a
way of structuring text. This means we are all speaking a common
language!
Part 1: Crawlers
If you are building a search engine, the crawler is where you spend a
good chunk of time. The crawler browses the open internet, starting
with a predefined list of seeds (e.g. Wikipedia.com, WSJ.com,
NYT.com). It will read each page, save it, and add new links to its
URL frontier, which is its queue of links to crawl
7. Many domains also have a robots.txt file, such as
google.com/robots.txt. This page specifies rules the crawler must
respect to avoid breaking any laws or being treated as a spammer. For
example, certain subdomains cannot be crawled and there may be a
minimum time between each crawl.
Why is so much time spent here? The internet is very much
unstructured, an anarchist’s dream. Sure, we may have some
norms that we agree on, but you will never realize how much
they get broken until you write a crawler.
For example, say your crawler reads HTML pages, as those have
structure. The author could still, for example, put non-links in link
tags that break some implicit logic in your program. There could be
emails (test@test.com), sentences, and other text that your
verification may miss.
8. You may be crawling a page that looks different every time but actually
generates dynamic content, such as including the current time? What
if page A redirects to B, B redirects to C, and C redirects to A? What if
a calendar has countless links to future years or days?
These are some of many cases that can arise when crawling
millions of pages, and every edge case needs to be covered or
recoverable.
Part 2: Indexing
Once you have the crawled content saved in a database, next comes
indexing! When a user searches a term, they want accurate results
quickly. This is where indexing is so important. You decide what
metrics matter the most to you, then you pull them from the crawled
document. Here are some common ones:
9. ● Forward Index: This is a data structure holding a list of
documents with their associated words, in order. For
example:
document1, <word1, word2, word3>
document2, <word2, word3>
● Inverted Index: This is a data structure holding a list of
words with documents with that word. For example:
word1, <document2, document3, document4>
word2, <document1, document2>
● Term Frequency (TF): This is a metric stored for each
unique word in each document. It is commonly calculated as
the number of occurrences of that word divided by the
number of words in a document, resulting in a value between
0 and 1. Some words may get weighted more heavily (e.g.
10. special tags) and the TF may be normalized, preventing
extreme values.
● Inverse Document Frequency (IDF): This is a metric
stored for each unique word. It is commonly calculated as the
number of documents with that word divided by the total
number of documents. Given that it requires the number of
documents, it is usually calculated after crawling or at query
time. It may be normalized to prevent extreme values.
With these four values, you can design an indexer that
allows you to return accurate results. With the optimization of
current databases, the results will also be reasonably fast. Using
MongoDB, our project used these to return results in approximately 2
seconds, even for longer queries. You can do even more with just these
four metrics — for example, allowing exact match queries.
These were essential metrics used by search engines in the early days.
Now, search engines use these plus many more to fine-tune their
results further.
11. How do we combine these to generate results? We will discuss that in
the integration section.
Part 3: PageRank
PageRank is an algorithm that determines the authoritativeness of
a page on the internet. Say someone is searching “Earth Day.” We
need to look at how trustworthy a page is. If we do not, our search
engine could send them to a random blog page that says “Earth Day”
over and over, rather than a Wikipedia or EarthDay.org page. With
the prevalence of SEO and marketers attempting to drive
traffic to a page, how do we ensure users get quality results?
PageRank looks at links between pages, treating them as a graph (a set
of nodes and vertices). Each vertex is a connection between two nodes
in the direction that it points (source URL to destination URL)
In each iteration, the algorithm looks at all URLs pointing to a page,
say Google.com. It gives Google some percentage of its referrers’
12. PageRank, based on how many other URLs those pages also point to.
After a few iterations, the PageRank values are relatively stable and
the algorithm terminates.
Source: https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg
There are other tricks used, like a random surfer, which assumes that
some percentage of the time the user gets bored and clicks to a new
13. page. These tricks aim to avoid corner cases with PageRank. For
example, sinks are pages that can absorb all the PageRank due to
having no outbound links.
Putting It All Together
You have the main pieces for a search engine now.
When a user searches a phrase, you look up which documents have
each of the query terms in them. Your database returns documents
that match all terms.
For every document, you can take the TFIDF (TF * IDF) for each query
term and sum them together. Then, combine the sum with that page’s
PageRank (e.g. multiplying them together). This is more an art
than a science, so leave time to see what works, finetuning as
you go.
14. Your engine can now return sorted results whenever a client makes a
request to your server’s URL. As discussed in part 0, all this work is
being done within the server. The client makes the request, and
your server returns the results to the client in the expected
format.
From here, you can:
● Add new indexing metrics to your database to create
higher-quality results
● Optimize your queries to speed up query times
● Build new features in your engine
Congratulations, you can now build your own search
engine!