This document discusses building a custom search engine that incorporates location-based results and auto-suggestions, a "Did You Mean" functionality, and domain-specific weighting of key terms. It describes modules for a custom crawler, Google Analytics integration, domain-specific data integration, and using Elasticsearch for searching. Challenges in building the crawler include concurrency issues during large crawls and preventing memory leaks. The overall goal is to create a search engine that goes beyond existing options by incorporating additional contextual data sources.
2. Agenda
● Architecture And Modules
● Custom Crawler
● Google Analytics Integration
● Google NLP
● Google CSE
● Domain Specific Data Integration
● Elasticsearch Capabilites Of Search
● Challenges
● Future Scope And Improvements
3. Problem Statement
Build a search Engine that......
● Gets you the relevant results and auto suggestions.
● Gets you the results which are popular, on top.
● Shows auto suggestions on the basis of Location.Ex. User
searching in Japan and USA should see different results.
● Has `Did you Mean` functionality.
● Is built in a way so that weights can be configured to some key
terms on the basis of Domain.
6. Build Your own custom Crawler
● Instead Of reinventing the wheel, use already built libraries like
Crawler4j, Jsoup etc.
● Keep Concurrency in Mind otherwise the process may take
forever.
● After Fetching HTML, extract Title, Meta Tags and sections
(element within Heading Tags) so that search can be performed
on them based on priority.
7. Custom Crawler Features
● Takes Seed Domain Urls and crawl all the pages within that
domain.
● Urls which are required not to be crawled can be specified.
● Regex can be specified if there are multiple pages that have to
be excluded
● PDF files crawling can be configured
● Takes robot.txt into consideration.
8. Custom Crawler Libraries
● Crawler4j: An open source web crawler for Java which provides
a simple interface for crawling the Web. In a multi-threaded
manner.
● Jsoup: Java library that provides a very convenient API for
extracting and manipulating data, using the best of DOM, CSS,
and jquery-like methods.
● BoilerPipe: Provides algorithms to detect and remove the
surplus "clutter" (boilerplate, templates) around the main textual
content of a web page.
10. Why Google analytics
● To Find analytics like top Hits, Exits, Page Views , feedback etc.
that too grouped by location. That would help in finding the
popular pages in a particular location.
● To display mostly used searched Terms in a particular location.
● To Show trending Terms.
11. How To Integrate GA
● Google provides java sdk that helps you fetch the details using
an API.
● Currently we also store details about city, country, Browser,
Operating System etc. against a aprticular URL.
● We Fetch 3 types of analytics- Page Analytics, Search Term
Analytics and Feed Back analytics
Isn't that easy ......
13. Why need Domain Specific Data
● Every Domain has some data that is relevant to it only. Since
Idea behind this custom search engine is to crawl websites of a
particular domain (And not the whole Web), we need to prepare
some data that is specific to it.
16. Why Data aggregation ?
● Now that we have data from all the different sources like Website,
Google Analytics, Domain Specific Content and all that in raw
form, we need to process and aggregate it in a form so that it can
be made searchable.
● We have already captured Page URLs content related data
(Using Crawler) and Analytics Information Using GA. Now its
time to merge both the information and calculate ranking of a
particular URL.
19. Available Options for Search
● Google CSE : A platform provided by Google that allows web
developers to feature specialized information in web searches,
refine and categorize queries and create customized search
engines, based on Google Search
● Elastic Swiftype : A fast, flexible search solution that helps you
surface your website’s most relevant content to your audience,
customers or users.
Both of the above available options provide searching but none of
them takes care of Location specific Search and Suggestions,
Analytics, Domain Specific Information.
20. How Do we search
● We make use of fast searching capabilities Of Elasticsearch to
perform search operation.
● We use Important Features Of ES like Functional Scoring, Fuzzy
Query, Cross Fields, Aggregations, nGram Analyzers etc.
● We maintain a Synonym file where we provides Synonyms,
abbreviations of the key terms.
22. Challenges Faced
Crawler Module
● Concurrency Issue while crawling large amount Of data:
Hit / Try and find the number of threads that are suitable enough
for the process to be fast but at the same time do not cause 'Out
Of Memory' Issue.
● Memory Leak Issue while Crawling: Make Sure You close all the
streams properly. Use String Builders whereever possible.