SlideShare a Scribd company logo
1 of 14
Download to read offline
Building Your Own Search Engine From
Scratch
What Underlies Modern Search? Simple Concepts.
How many times do you search the web daily? 5, 20, 50? If Google is
your search engine of choice, you can look at your history of searches
here.
Despite how deeply search underlies our daily activities and
interaction with the world, few of us understand how it works. In this
post, I work to illuminate the underpinnings of search. This is from
implementing a search engine, based on the original Google
implementation.
CLICK HERE FOR COMPLETE CODE
Photo by Benjamin Dada on Unsplash
First, we will look at a preliminary step: understanding web
servers. What is a client-server infrastructure? How does your
computer connect to a website?
You will get to see what happens when a search engine connects with
your computer, a website, or anything else.
Then, we will look through the three base parts of a search
engine: the crawler, indexer, and PageRank algorithm. Each of
these will give you a rich understanding of the connections that make
up the spider web that is the internet.
Finally, we will go through how those components are combined to
produce our final prize: the search engine! Ready to dive in?
Let’s go!
Part 0: Web Servers
The mighty web server! A web server is what your computer contacts
every time you search a URL in your browser. Your browser acts as a
client, sending a request, similar to a business client. The server is the
sales representative who takes all those requests, processing them in
parallel.
The requests are text. The server knows how to read them, as it is
expecting them in a specific structure (the most common
protocol/structure now is HTTP/1.1).
A sample request:
GET /hello HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.sample-why-david-y.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: user=why-david-y
The request can have arguments, given by its list of cookies. It may
have a body with more information. The response follows a similar
format, allowing the server to return arguments and a body for clients
to read. With the internet becoming more interactive, all the hard
work of generating content is done on the server.
If you ever want to program a web server or client, there are many
libraries available to you to do most of the parsing and low-level work.
Just keep in mind that client requests and server responses are just a
way of structuring text. This means we are all speaking a common
language!
Part 1: Crawlers
If you are building a search engine, the crawler is where you spend a
good chunk of time. The crawler browses the open internet, starting
with a predefined list of seeds (e.g. Wikipedia.com, WSJ.com,
NYT.com). It will read each page, save it, and add new links to its
URL frontier, which is its queue of links to crawl
Photo by Kevin Grieve on Unsplash
Many domains also have a robots.txt file, such as
google.com/robots.txt. This page specifies rules the crawler must
respect to avoid breaking any laws or being treated as a spammer. For
example, certain subdomains cannot be crawled and there may be a
minimum time between each crawl.
Why is so much time spent here? The internet is very much
unstructured, an anarchist’s dream. Sure, we may have some
norms that we agree on, but you will never realize how much
they get broken until you write a crawler.
For example, say your crawler reads HTML pages, as those have
structure. The author could still, for example, put non-links in link
tags that break some implicit logic in your program. There could be
emails (test@test.com), sentences, and other text that your
verification may miss.
You may be crawling a page that looks different every time but actually
generates dynamic content, such as including the current time? What
if page A redirects to B, B redirects to C, and C redirects to A? What if
a calendar has countless links to future years or days?
These are some of many cases that can arise when crawling
millions of pages, and every edge case needs to be covered or
recoverable.
Part 2: Indexing
Once you have the crawled content saved in a database, next comes
indexing! When a user searches a term, they want accurate results
quickly. This is where indexing is so important. You decide what
metrics matter the most to you, then you pull them from the crawled
document. Here are some common ones:
● Forward Index: This is a data structure holding a list of
documents with their associated words, in order. For
example:
document1, <word1, word2, word3>
document2, <word2, word3>
● Inverted Index: This is a data structure holding a list of
words with documents with that word. For example:
word1, <document2, document3, document4>
word2, <document1, document2>
● Term Frequency (TF): This is a metric stored for each
unique word in each document. It is commonly calculated as
the number of occurrences of that word divided by the
number of words in a document, resulting in a value between
0 and 1. Some words may get weighted more heavily (e.g.
special tags) and the TF may be normalized, preventing
extreme values.
● Inverse Document Frequency (IDF): This is a metric
stored for each unique word. It is commonly calculated as the
number of documents with that word divided by the total
number of documents. Given that it requires the number of
documents, it is usually calculated after crawling or at query
time. It may be normalized to prevent extreme values.
With these four values, you can design an indexer that
allows you to return accurate results. With the optimization of
current databases, the results will also be reasonably fast. Using
MongoDB, our project used these to return results in approximately 2
seconds, even for longer queries. You can do even more with just these
four metrics — for example, allowing exact match queries.
These were essential metrics used by search engines in the early days.
Now, search engines use these plus many more to fine-tune their
results further.
How do we combine these to generate results? We will discuss that in
the integration section.
Part 3: PageRank
PageRank is an algorithm that determines the authoritativeness of
a page on the internet. Say someone is searching “Earth Day.” We
need to look at how trustworthy a page is. If we do not, our search
engine could send them to a random blog page that says “Earth Day”
over and over, rather than a Wikipedia or EarthDay.org page. With
the prevalence of SEO and marketers attempting to drive
traffic to a page, how do we ensure users get quality results?
PageRank looks at links between pages, treating them as a graph (a set
of nodes and vertices). Each vertex is a connection between two nodes
in the direction that it points (source URL to destination URL)
In each iteration, the algorithm looks at all URLs pointing to a page,
say Google.com. It gives Google some percentage of its referrers’
PageRank, based on how many other URLs those pages also point to.
After a few iterations, the PageRank values are relatively stable and
the algorithm terminates.
Source: https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg
There are other tricks used, like a random surfer, which assumes that
some percentage of the time the user gets bored and clicks to a new
page. These tricks aim to avoid corner cases with PageRank. For
example, sinks are pages that can absorb all the PageRank due to
having no outbound links.
Putting It All Together
You have the main pieces for a search engine now.
When a user searches a phrase, you look up which documents have
each of the query terms in them. Your database returns documents
that match all terms.
For every document, you can take the TFIDF (TF * IDF) for each query
term and sum them together. Then, combine the sum with that page’s
PageRank (e.g. multiplying them together). This is more an art
than a science, so leave time to see what works, finetuning as
you go.
Your engine can now return sorted results whenever a client makes a
request to your server’s URL. As discussed in part 0, all this work is
being done within the server. The client makes the request, and
your server returns the results to the client in the expected
format.
From here, you can:
● Add new indexing metrics to your database to create
higher-quality results
● Optimize your queries to speed up query times
● Build new features in your engine
Congratulations, you can now build your own search
engine!

More Related Content

Similar to Create search engine without code

SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in Indiaannakoch32
 
SEO Basics - SEO Company in India
SEO Basics - SEO Company in IndiaSEO Basics - SEO Company in India
SEO Basics - SEO Company in Indiaannakoch32
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşAykut Aslantaş
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 
The Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineThe Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineManish Chopra
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
How search engine works
How search engine worksHow search engine works
How search engine worksAshraf Ali
 
Search engine optimization (seo)
Search engine optimization (seo)Search engine optimization (seo)
Search engine optimization (seo)jhon smith
 
Search Engine Optimization (Seo)
Search Engine Optimization (Seo)Search Engine Optimization (Seo)
Search Engine Optimization (Seo)ssunnysengar
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
 
Chewy Trewella - Google Searchtips
Chewy Trewella - Google SearchtipsChewy Trewella - Google Searchtips
Chewy Trewella - Google Searchtipssounddelivery
 
Web Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics WebWeb Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics WebAatif19921
 
An SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfAn SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfMindfire LLC
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdfCreationlabz
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 

Similar to Create search engine without code (20)

SEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in IndiaSEO Tutorial - SEO Company in India
SEO Tutorial - SEO Company in India
 
SEO Basics - SEO Company in India
SEO Basics - SEO Company in IndiaSEO Basics - SEO Company in India
SEO Basics - SEO Company in India
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
The Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineThe Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search Engine
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
How search engine works
How search engine worksHow search engine works
How search engine works
 
Search engine optimization (seo)
Search engine optimization (seo)Search engine optimization (seo)
Search engine optimization (seo)
 
Search Engine Optimization (Seo)
Search Engine Optimization (Seo)Search Engine Optimization (Seo)
Search Engine Optimization (Seo)
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Seo report
Seo reportSeo report
Seo report
 
E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Chewy Trewella - Google Searchtips
Chewy Trewella - Google SearchtipsChewy Trewella - Google Searchtips
Chewy Trewella - Google Searchtips
 
Web Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics WebWeb Search Engine, Web Crawler, and Semantics Web
Web Search Engine, Web Crawler, and Semantics Web
 
An SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfAn SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdf
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 

Recently uploaded

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Recently uploaded (20)

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Create search engine without code

  • 1. Building Your Own Search Engine From Scratch What Underlies Modern Search? Simple Concepts. How many times do you search the web daily? 5, 20, 50? If Google is your search engine of choice, you can look at your history of searches here. Despite how deeply search underlies our daily activities and interaction with the world, few of us understand how it works. In this post, I work to illuminate the underpinnings of search. This is from implementing a search engine, based on the original Google implementation. CLICK HERE FOR COMPLETE CODE
  • 2. Photo by Benjamin Dada on Unsplash First, we will look at a preliminary step: understanding web servers. What is a client-server infrastructure? How does your computer connect to a website? You will get to see what happens when a search engine connects with your computer, a website, or anything else.
  • 3. Then, we will look through the three base parts of a search engine: the crawler, indexer, and PageRank algorithm. Each of these will give you a rich understanding of the connections that make up the spider web that is the internet. Finally, we will go through how those components are combined to produce our final prize: the search engine! Ready to dive in? Let’s go! Part 0: Web Servers The mighty web server! A web server is what your computer contacts every time you search a URL in your browser. Your browser acts as a client, sending a request, similar to a business client. The server is the sales representative who takes all those requests, processing them in parallel.
  • 4. The requests are text. The server knows how to read them, as it is expecting them in a specific structure (the most common protocol/structure now is HTTP/1.1). A sample request: GET /hello HTTP/1.1 User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT) Host: www.sample-why-david-y.com Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: Keep-Alive Cookie: user=why-david-y
  • 5. The request can have arguments, given by its list of cookies. It may have a body with more information. The response follows a similar format, allowing the server to return arguments and a body for clients to read. With the internet becoming more interactive, all the hard work of generating content is done on the server. If you ever want to program a web server or client, there are many libraries available to you to do most of the parsing and low-level work. Just keep in mind that client requests and server responses are just a way of structuring text. This means we are all speaking a common language! Part 1: Crawlers If you are building a search engine, the crawler is where you spend a good chunk of time. The crawler browses the open internet, starting with a predefined list of seeds (e.g. Wikipedia.com, WSJ.com, NYT.com). It will read each page, save it, and add new links to its URL frontier, which is its queue of links to crawl
  • 6. Photo by Kevin Grieve on Unsplash
  • 7. Many domains also have a robots.txt file, such as google.com/robots.txt. This page specifies rules the crawler must respect to avoid breaking any laws or being treated as a spammer. For example, certain subdomains cannot be crawled and there may be a minimum time between each crawl. Why is so much time spent here? The internet is very much unstructured, an anarchist’s dream. Sure, we may have some norms that we agree on, but you will never realize how much they get broken until you write a crawler. For example, say your crawler reads HTML pages, as those have structure. The author could still, for example, put non-links in link tags that break some implicit logic in your program. There could be emails (test@test.com), sentences, and other text that your verification may miss.
  • 8. You may be crawling a page that looks different every time but actually generates dynamic content, such as including the current time? What if page A redirects to B, B redirects to C, and C redirects to A? What if a calendar has countless links to future years or days? These are some of many cases that can arise when crawling millions of pages, and every edge case needs to be covered or recoverable. Part 2: Indexing Once you have the crawled content saved in a database, next comes indexing! When a user searches a term, they want accurate results quickly. This is where indexing is so important. You decide what metrics matter the most to you, then you pull them from the crawled document. Here are some common ones:
  • 9. ● Forward Index: This is a data structure holding a list of documents with their associated words, in order. For example: document1, <word1, word2, word3> document2, <word2, word3> ● Inverted Index: This is a data structure holding a list of words with documents with that word. For example: word1, <document2, document3, document4> word2, <document1, document2> ● Term Frequency (TF): This is a metric stored for each unique word in each document. It is commonly calculated as the number of occurrences of that word divided by the number of words in a document, resulting in a value between 0 and 1. Some words may get weighted more heavily (e.g.
  • 10. special tags) and the TF may be normalized, preventing extreme values. ● Inverse Document Frequency (IDF): This is a metric stored for each unique word. It is commonly calculated as the number of documents with that word divided by the total number of documents. Given that it requires the number of documents, it is usually calculated after crawling or at query time. It may be normalized to prevent extreme values. With these four values, you can design an indexer that allows you to return accurate results. With the optimization of current databases, the results will also be reasonably fast. Using MongoDB, our project used these to return results in approximately 2 seconds, even for longer queries. You can do even more with just these four metrics — for example, allowing exact match queries. These were essential metrics used by search engines in the early days. Now, search engines use these plus many more to fine-tune their results further.
  • 11. How do we combine these to generate results? We will discuss that in the integration section. Part 3: PageRank PageRank is an algorithm that determines the authoritativeness of a page on the internet. Say someone is searching “Earth Day.” We need to look at how trustworthy a page is. If we do not, our search engine could send them to a random blog page that says “Earth Day” over and over, rather than a Wikipedia or EarthDay.org page. With the prevalence of SEO and marketers attempting to drive traffic to a page, how do we ensure users get quality results? PageRank looks at links between pages, treating them as a graph (a set of nodes and vertices). Each vertex is a connection between two nodes in the direction that it points (source URL to destination URL) In each iteration, the algorithm looks at all URLs pointing to a page, say Google.com. It gives Google some percentage of its referrers’
  • 12. PageRank, based on how many other URLs those pages also point to. After a few iterations, the PageRank values are relatively stable and the algorithm terminates. Source: https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg There are other tricks used, like a random surfer, which assumes that some percentage of the time the user gets bored and clicks to a new
  • 13. page. These tricks aim to avoid corner cases with PageRank. For example, sinks are pages that can absorb all the PageRank due to having no outbound links. Putting It All Together You have the main pieces for a search engine now. When a user searches a phrase, you look up which documents have each of the query terms in them. Your database returns documents that match all terms. For every document, you can take the TFIDF (TF * IDF) for each query term and sum them together. Then, combine the sum with that page’s PageRank (e.g. multiplying them together). This is more an art than a science, so leave time to see what works, finetuning as you go.
  • 14. Your engine can now return sorted results whenever a client makes a request to your server’s URL. As discussed in part 0, all this work is being done within the server. The client makes the request, and your server returns the results to the client in the expected format. From here, you can: ● Add new indexing metrics to your database to create higher-quality results ● Optimize your queries to speed up query times ● Build new features in your engine Congratulations, you can now build your own search engine!