SlideShare a Scribd company logo
1 of 14
Download to read offline
Building Your Own Search Engine From
Scratch
What Underlies Modern Search? Simple Concepts.
How many times do you search the web daily? 5, 20, 50? If Google is
your search engine of choice, you can look at your history of searches
here.
Despite how deeply search underlies our daily activities and
interaction with the world, few of us understand how it works. In this
post, I work to illuminate the underpinnings of search. This is from
implementing a search engine, based on the original Google
implementation.
CLICK HERE FOR COMPLETE CODE
Photo by Benjamin Dada on Unsplash
First, we will look at a preliminary step: understanding web
servers. What is a client-server infrastructure? How does your
computer connect to a website?
You will get to see what happens when a search engine connects with
your computer, a website, or anything else.
Then, we will look through the three base parts of a search
engine: the crawler, indexer, and PageRank algorithm. Each of
these will give you a rich understanding of the connections that make
up the spider web that is the internet.
Finally, we will go through how those components are combined to
produce our final prize: the search engine! Ready to dive in?
Let’s go!
Part 0: Web Servers
The mighty web server! A web server is what your computer contacts
every time you search a URL in your browser. Your browser acts as a
client, sending a request, similar to a business client. The server is the
sales representative who takes all those requests, processing them in
parallel.
The requests are text. The server knows how to read them, as it is
expecting them in a specific structure (the most common
protocol/structure now is HTTP/1.1).
A sample request:
GET /hello HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.sample-why-david-y.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: user=why-david-y
The request can have arguments, given by its list of cookies. It may
have a body with more information. The response follows a similar
format, allowing the server to return arguments and a body for clients
to read. With the internet becoming more interactive, all the hard
work of generating content is done on the server.
If you ever want to program a web server or client, there are many
libraries available to you to do most of the parsing and low-level work.
Just keep in mind that client requests and server responses are just a
way of structuring text. This means we are all speaking a common
language!
Part 1: Crawlers
If you are building a search engine, the crawler is where you spend a
good chunk of time. The crawler browses the open internet, starting
with a predefined list of seeds (e.g. Wikipedia.com, WSJ.com,
NYT.com). It will read each page, save it, and add new links to its
URL frontier, which is its queue of links to crawl
Photo by Kevin Grieve on Unsplash
Many domains also have a robots.txt file, such as
google.com/robots.txt. This page specifies rules the crawler must
respect to avoid breaking any laws or being treated as a spammer. For
example, certain subdomains cannot be crawled and there may be a
minimum time between each crawl.
Why is so much time spent here? The internet is very much
unstructured, an anarchist’s dream. Sure, we may have some
norms that we agree on, but you will never realize how much
they get broken until you write a crawler.
For example, say your crawler reads HTML pages, as those have
structure. The author could still, for example, put non-links in link
tags that break some implicit logic in your program. There could be
emails (test@test.com), sentences, and other text that your
verification may miss.
You may be crawling a page that looks different every time but actually
generates dynamic content, such as including the current time? What
if page A redirects to B, B redirects to C, and C redirects to A? What if
a calendar has countless links to future years or days?
These are some of many cases that can arise when crawling
millions of pages, and every edge case needs to be covered or
recoverable.
Part 2: Indexing
Once you have the crawled content saved in a database, next comes
indexing! When a user searches a term, they want accurate results
quickly. This is where indexing is so important. You decide what
metrics matter the most to you, then you pull them from the crawled
document. Here are some common ones:
● Forward Index: This is a data structure holding a list of
documents with their associated words, in order. For
example:
document1, <word1, word2, word3>
document2, <word2, word3>
● Inverted Index: This is a data structure holding a list of
words with documents with that word. For example:
word1, <document2, document3, document4>
word2, <document1, document2>
● Term Frequency (TF): This is a metric stored for each
unique word in each document. It is commonly calculated as
the number of occurrences of that word divided by the
number of words in a document, resulting in a value between
0 and 1. Some words may get weighted more heavily (e.g.
special tags) and the TF may be normalized, preventing
extreme values.
● Inverse Document Frequency (IDF): This is a metric
stored for each unique word. It is commonly calculated as the
number of documents with that word divided by the total
number of documents. Given that it requires the number of
documents, it is usually calculated after crawling or at query
time. It may be normalized to prevent extreme values.
With these four values, you can design an indexer that
allows you to return accurate results. With the optimization of
current databases, the results will also be reasonably fast. Using
MongoDB, our project used these to return results in approximately 2
seconds, even for longer queries. You can do even more with just these
four metrics — for example, allowing exact match queries.
These were essential metrics used by search engines in the early days.
Now, search engines use these plus many more to fine-tune their
results further.
How do we combine these to generate results? We will discuss that in
the integration section.
Part 3: PageRank
PageRank is an algorithm that determines the authoritativeness of
a page on the internet. Say someone is searching “Earth Day.” We
need to look at how trustworthy a page is. If we do not, our search
engine could send them to a random blog page that says “Earth Day”
over and over, rather than a Wikipedia or EarthDay.org page. With
the prevalence of SEO and marketers attempting to drive
traffic to a page, how do we ensure users get quality results?
PageRank looks at links between pages, treating them as a graph (a set
of nodes and vertices). Each vertex is a connection between two nodes
in the direction that it points (source URL to destination URL)
In each iteration, the algorithm looks at all URLs pointing to a page,
say Google.com. It gives Google some percentage of its referrers’
PageRank, based on how many other URLs those pages also point to.
After a few iterations, the PageRank values are relatively stable and
the algorithm terminates.
Source: https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg
There are other tricks used, like a random surfer, which assumes that
some percentage of the time the user gets bored and clicks to a new
page. These tricks aim to avoid corner cases with PageRank. For
example, sinks are pages that can absorb all the PageRank due to
having no outbound links.
Putting It All Together
You have the main pieces for a search engine now.
When a user searches a phrase, you look up which documents have
each of the query terms in them. Your database returns documents
that match all terms.
For every document, you can take the TFIDF (TF * IDF) for each query
term and sum them together. Then, combine the sum with that page’s
PageRank (e.g. multiplying them together). This is more an art
than a science, so leave time to see what works, finetuning as
you go.
Your engine can now return sorted results whenever a client makes a
request to your server’s URL. As discussed in part 0, all this work is
being done within the server. The client makes the request, and
your server returns the results to the client in the expected
format.
From here, you can:
● Add new indexing metrics to your database to create
higher-quality results
● Optimize your queries to speed up query times
● Build new features in your engine
Congratulations, you can now build your own search
engine!

More Related Content

Similar to Create search engine without code

SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in Indiaannakoch32
 
SEO Basics - SEO Company in India
SEO Basics - SEO Company in IndiaSEO Basics - SEO Company in India
SEO Basics - SEO Company in Indiaannakoch32
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşAykut Aslantaş
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 
The Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineThe Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineManish Chopra
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
How search engine works
How search engine worksHow search engine works
How search engine worksAshraf Ali
 
Search Engine Optimization (Seo)
Search Engine Optimization (Seo)Search Engine Optimization (Seo)
Search Engine Optimization (Seo)ssunnysengar
 
Search engine optimization (seo)
Search engine optimization (seo)Search engine optimization (seo)
Search engine optimization (seo)jhon smith
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
 
Chewy Trewella - Google Searchtips
Chewy Trewella - Google SearchtipsChewy Trewella - Google Searchtips
Chewy Trewella - Google Searchtipssounddelivery
 
An SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfAn SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfMindfire LLC
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdfCreationlabz
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 
The New Content SEO - Sydney SEO Conference 2023
The New Content SEO - Sydney SEO Conference 2023The New Content SEO - Sydney SEO Conference 2023
The New Content SEO - Sydney SEO Conference 2023Amanda King
 

Similar to Create search engine without code (20)

SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in India
 
SEO Basics - SEO Company in India
SEO Basics - SEO Company in IndiaSEO Basics - SEO Company in India
SEO Basics - SEO Company in India
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
The Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineThe Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search Engine
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
How search engine works
How search engine worksHow search engine works
How search engine works
 
Search Engine Optimization (Seo)
Search Engine Optimization (Seo)Search Engine Optimization (Seo)
Search Engine Optimization (Seo)
 
Search engine optimization (seo)
Search engine optimization (seo)Search engine optimization (seo)
Search engine optimization (seo)
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Seo report
Seo reportSeo report
Seo report
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
E017624043
E017624043E017624043
E017624043
 
Chewy Trewella - Google Searchtips
Chewy Trewella - Google SearchtipsChewy Trewella - Google Searchtips
Chewy Trewella - Google Searchtips
 
An SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfAn SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdf
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
The New Content SEO - Sydney SEO Conference 2023
The New Content SEO - Sydney SEO Conference 2023The New Content SEO - Sydney SEO Conference 2023
The New Content SEO - Sydney SEO Conference 2023
 

Recently uploaded

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 

Recently uploaded (20)

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 

Create search engine without code

  • 1. Building Your Own Search Engine From Scratch What Underlies Modern Search? Simple Concepts. How many times do you search the web daily? 5, 20, 50? If Google is your search engine of choice, you can look at your history of searches here. Despite how deeply search underlies our daily activities and interaction with the world, few of us understand how it works. In this post, I work to illuminate the underpinnings of search. This is from implementing a search engine, based on the original Google implementation. CLICK HERE FOR COMPLETE CODE
  • 2. Photo by Benjamin Dada on Unsplash First, we will look at a preliminary step: understanding web servers. What is a client-server infrastructure? How does your computer connect to a website? You will get to see what happens when a search engine connects with your computer, a website, or anything else.
  • 3. Then, we will look through the three base parts of a search engine: the crawler, indexer, and PageRank algorithm. Each of these will give you a rich understanding of the connections that make up the spider web that is the internet. Finally, we will go through how those components are combined to produce our final prize: the search engine! Ready to dive in? Let’s go! Part 0: Web Servers The mighty web server! A web server is what your computer contacts every time you search a URL in your browser. Your browser acts as a client, sending a request, similar to a business client. The server is the sales representative who takes all those requests, processing them in parallel.
  • 4. The requests are text. The server knows how to read them, as it is expecting them in a specific structure (the most common protocol/structure now is HTTP/1.1). A sample request: GET /hello HTTP/1.1 User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT) Host: www.sample-why-david-y.com Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: Keep-Alive Cookie: user=why-david-y
  • 5. The request can have arguments, given by its list of cookies. It may have a body with more information. The response follows a similar format, allowing the server to return arguments and a body for clients to read. With the internet becoming more interactive, all the hard work of generating content is done on the server. If you ever want to program a web server or client, there are many libraries available to you to do most of the parsing and low-level work. Just keep in mind that client requests and server responses are just a way of structuring text. This means we are all speaking a common language! Part 1: Crawlers If you are building a search engine, the crawler is where you spend a good chunk of time. The crawler browses the open internet, starting with a predefined list of seeds (e.g. Wikipedia.com, WSJ.com, NYT.com). It will read each page, save it, and add new links to its URL frontier, which is its queue of links to crawl
  • 6. Photo by Kevin Grieve on Unsplash
  • 7. Many domains also have a robots.txt file, such as google.com/robots.txt. This page specifies rules the crawler must respect to avoid breaking any laws or being treated as a spammer. For example, certain subdomains cannot be crawled and there may be a minimum time between each crawl. Why is so much time spent here? The internet is very much unstructured, an anarchist’s dream. Sure, we may have some norms that we agree on, but you will never realize how much they get broken until you write a crawler. For example, say your crawler reads HTML pages, as those have structure. The author could still, for example, put non-links in link tags that break some implicit logic in your program. There could be emails (test@test.com), sentences, and other text that your verification may miss.
  • 8. You may be crawling a page that looks different every time but actually generates dynamic content, such as including the current time? What if page A redirects to B, B redirects to C, and C redirects to A? What if a calendar has countless links to future years or days? These are some of many cases that can arise when crawling millions of pages, and every edge case needs to be covered or recoverable. Part 2: Indexing Once you have the crawled content saved in a database, next comes indexing! When a user searches a term, they want accurate results quickly. This is where indexing is so important. You decide what metrics matter the most to you, then you pull them from the crawled document. Here are some common ones:
  • 9. ● Forward Index: This is a data structure holding a list of documents with their associated words, in order. For example: document1, <word1, word2, word3> document2, <word2, word3> ● Inverted Index: This is a data structure holding a list of words with documents with that word. For example: word1, <document2, document3, document4> word2, <document1, document2> ● Term Frequency (TF): This is a metric stored for each unique word in each document. It is commonly calculated as the number of occurrences of that word divided by the number of words in a document, resulting in a value between 0 and 1. Some words may get weighted more heavily (e.g.
  • 10. special tags) and the TF may be normalized, preventing extreme values. ● Inverse Document Frequency (IDF): This is a metric stored for each unique word. It is commonly calculated as the number of documents with that word divided by the total number of documents. Given that it requires the number of documents, it is usually calculated after crawling or at query time. It may be normalized to prevent extreme values. With these four values, you can design an indexer that allows you to return accurate results. With the optimization of current databases, the results will also be reasonably fast. Using MongoDB, our project used these to return results in approximately 2 seconds, even for longer queries. You can do even more with just these four metrics — for example, allowing exact match queries. These were essential metrics used by search engines in the early days. Now, search engines use these plus many more to fine-tune their results further.
  • 11. How do we combine these to generate results? We will discuss that in the integration section. Part 3: PageRank PageRank is an algorithm that determines the authoritativeness of a page on the internet. Say someone is searching “Earth Day.” We need to look at how trustworthy a page is. If we do not, our search engine could send them to a random blog page that says “Earth Day” over and over, rather than a Wikipedia or EarthDay.org page. With the prevalence of SEO and marketers attempting to drive traffic to a page, how do we ensure users get quality results? PageRank looks at links between pages, treating them as a graph (a set of nodes and vertices). Each vertex is a connection between two nodes in the direction that it points (source URL to destination URL) In each iteration, the algorithm looks at all URLs pointing to a page, say Google.com. It gives Google some percentage of its referrers’
  • 12. PageRank, based on how many other URLs those pages also point to. After a few iterations, the PageRank values are relatively stable and the algorithm terminates. Source: https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg There are other tricks used, like a random surfer, which assumes that some percentage of the time the user gets bored and clicks to a new
  • 13. page. These tricks aim to avoid corner cases with PageRank. For example, sinks are pages that can absorb all the PageRank due to having no outbound links. Putting It All Together You have the main pieces for a search engine now. When a user searches a phrase, you look up which documents have each of the query terms in them. Your database returns documents that match all terms. For every document, you can take the TFIDF (TF * IDF) for each query term and sum them together. Then, combine the sum with that page’s PageRank (e.g. multiplying them together). This is more an art than a science, so leave time to see what works, finetuning as you go.
  • 14. Your engine can now return sorted results whenever a client makes a request to your server’s URL. As discussed in part 0, all this work is being done within the server. The client makes the request, and your server returns the results to the client in the expected format. From here, you can: ● Add new indexing metrics to your database to create higher-quality results ● Optimize your queries to speed up query times ● Build new features in your engine Congratulations, you can now build your own search engine!