SlideShare a Scribd company logo
1 of 22
AI Based Custom Search Engine
Agenda
● Architecture And Modules
● Custom Crawler
● Google Analytics Integration
● Google NLP
● Google CSE
● Domain Specific Data Integration
● Elasticsearch Capabilites Of Search
● Challenges
● Future Scope And Improvements
Problem Statement
Build a search Engine that......
● Gets you the relevant results and auto suggestions.
● Gets you the results which are popular, on top.
● Shows auto suggestions on the basis of Location.Ex. User
searching in Japan and USA should see different results.
● Has `Did you Mean` functionality.
● Is built in a way so that weights can be configured to some key
terms on the basis of Domain.
Architecture
Build Your own custom Crawler
● Instead Of reinventing the wheel, use already built libraries like
Crawler4j, Jsoup etc.
● Keep Concurrency in Mind otherwise the process may take
forever.
● After Fetching HTML, extract Title, Meta Tags and sections
(element within Heading Tags) so that search can be performed
on them based on priority.
Custom Crawler Features
● Takes Seed Domain Urls and crawl all the pages within that
domain.
● Urls which are required not to be crawled can be specified.
● Regex can be specified if there are multiple pages that have to
be excluded
● PDF files crawling can be configured
● Takes robot.txt into consideration.
Custom Crawler Libraries
● Crawler4j: An open source web crawler for Java which provides
a simple interface for crawling the Web. In a multi-threaded
manner.
● Jsoup: Java library that provides a very convenient API for
extracting and manipulating data, using the best of DOM, CSS,
and jquery-like methods.
● BoilerPipe: Provides algorithms to detect and remove the
surplus "clutter" (boilerplate, templates) around the main textual
content of a web page.
Google Analytics Integration Module
Why Google analytics
● To Find analytics like top Hits, Exits, Page Views , feedback etc.
that too grouped by location. That would help in finding the
popular pages in a particular location.
● To display mostly used searched Terms in a particular location.
● To Show trending Terms.
How To Integrate GA
● Google provides java sdk that helps you fetch the details using
an API.
● Currently we also store details about city, country, Browser,
Operating System etc. against a aprticular URL.
● We Fetch 3 types of analytics- Page Analytics, Search Term
Analytics and Feed Back analytics
Isn't that easy ......
Domain Specific Content
Why need Domain Specific Data
● Every Domain has some data that is relevant to it only. Since
Idea behind this custom search engine is to crawl websites of a
particular domain (And not the whole Web), we need to prepare
some data that is specific to it.
Domain Specific Data Format
[{
"type": "Key Term",
"value": "FDI",
"weight": 15.0
},
{
"type": "Key Term",
"value": "Taxation",
"weight": 15.0
},
{
"type": "State",
"value": "Gujarat",
"weight": 5.0
}]
Data Aggregation
Why Data aggregation ?
● Now that we have data from all the different sources like Website,
Google Analytics, Domain Specific Content and all that in raw
form, we need to process and aggregate it in a form so that it can
be made searchable.
● We have already captured Page URLs content related data
(Using Crawler) and Analytics Information Using GA. Now its
time to merge both the information and calculate ranking of a
particular URL.
Page Ranking
● Page Score
(pageViews - exits) / (pageViews + exits)
● FeedbackScore
(pageViews + positiveFeedbackCount) / (pageViews +
negativeFeedbackCount)
Finally The Search Module
Available Options for Search
● Google CSE : A platform provided by Google that allows web
developers to feature specialized information in web searches,
refine and categorize queries and create customized search
engines, based on Google Search
● Elastic Swiftype : A fast, flexible search solution that helps you
surface your website’s most relevant content to your audience,
customers or users.
Both of the above available options provide searching but none of
them takes care of Location specific Search and Suggestions,
Analytics, Domain Specific Information.
How Do we search
● We make use of fast searching capabilities Of Elasticsearch to
perform search operation.
● We use Important Features Of ES like Functional Scoring, Fuzzy
Query, Cross Fields, Aggregations, nGram Analyzers etc.
● We maintain a Synonym file where we provides Synonyms,
abbreviations of the key terms.
Search Flow
Challenges Faced
Crawler Module
● Concurrency Issue while crawling large amount Of data:
Hit / Try and find the number of threads that are suitable enough
for the process to be fast but at the same time do not cause 'Out
Of Memory' Issue.
● Memory Leak Issue while Crawling: Make Sure You close all the
streams properly. Use String Builders whereever possible.

More Related Content

Similar to AI Based Custom Search Engine with Location Specific Results

Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Marshal Yung
 
Becoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIBecoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIcgmonroe
 
How to run an easy SEO Audit
How to run an easy SEO AuditHow to run an easy SEO Audit
How to run an easy SEO AuditGrégoire Lacan
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO BasicsJenifer Renjini
 
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptxIntegrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptxBegum Kaya
 
Learn Schema Markup to boost your SEO
Learn Schema Markup to boost your SEOLearn Schema Markup to boost your SEO
Learn Schema Markup to boost your SEOeMarket Education
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGVignesh sitaraman
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik KornasAEM HUB
 
Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017
Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017
Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017UX Antwerp Meetup
 
Digital marketing
Digital marketing Digital marketing
Digital marketing M Manas
 
What You Need to Know About Technical SEO
What You Need to Know About Technical SEOWhat You Need to Know About Technical SEO
What You Need to Know About Technical SEONiki Mosier
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engineguestf460ed0
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......Smayara
 

Similar to AI Based Custom Search Engine with Location Specific Results (20)

Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?
 
Becoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIBecoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search API
 
Basics of SEO
Basics of SEO Basics of SEO
Basics of SEO
 
How to run an easy SEO Audit
How to run an easy SEO AuditHow to run an easy SEO Audit
How to run an easy SEO Audit
 
Introduction to SEO Basics
Introduction to SEO BasicsIntroduction to SEO Basics
Introduction to SEO Basics
 
Seo
SeoSeo
Seo
 
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptxIntegrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
 
Learn Schema Markup to boost your SEO
Learn Schema Markup to boost your SEOLearn Schema Markup to boost your SEO
Learn Schema Markup to boost your SEO
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATG
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
 
Modern JavaScript and SEO
Modern JavaScript and SEOModern JavaScript and SEO
Modern JavaScript and SEO
 
How Google Search Works
How Google Search WorksHow Google Search Works
How Google Search Works
 
Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017
Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017
Pawel Sokolowski at UX Antwerp Meetup - 26 September 2017
 
Digital marketing
Digital marketing Digital marketing
Digital marketing
 
What You Need to Know About Technical SEO
What You Need to Know About Technical SEOWhat You Need to Know About Technical SEO
What You Need to Know About Technical SEO
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
Google
GoogleGoogle
Google
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
 
Google Search Engine
Google Search EngineGoogle Search Engine
Google Search Engine
 
best Digital Marketing ppt for all......
best Digital Marketing ppt for all......best Digital Marketing ppt for all......
best Digital Marketing ppt for all......
 

Recently uploaded

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 

Recently uploaded (20)

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 

AI Based Custom Search Engine with Location Specific Results

  • 1. AI Based Custom Search Engine
  • 2. Agenda ● Architecture And Modules ● Custom Crawler ● Google Analytics Integration ● Google NLP ● Google CSE ● Domain Specific Data Integration ● Elasticsearch Capabilites Of Search ● Challenges ● Future Scope And Improvements
  • 3. Problem Statement Build a search Engine that...... ● Gets you the relevant results and auto suggestions. ● Gets you the results which are popular, on top. ● Shows auto suggestions on the basis of Location.Ex. User searching in Japan and USA should see different results. ● Has `Did you Mean` functionality. ● Is built in a way so that weights can be configured to some key terms on the basis of Domain.
  • 5.
  • 6. Build Your own custom Crawler ● Instead Of reinventing the wheel, use already built libraries like Crawler4j, Jsoup etc. ● Keep Concurrency in Mind otherwise the process may take forever. ● After Fetching HTML, extract Title, Meta Tags and sections (element within Heading Tags) so that search can be performed on them based on priority.
  • 7. Custom Crawler Features ● Takes Seed Domain Urls and crawl all the pages within that domain. ● Urls which are required not to be crawled can be specified. ● Regex can be specified if there are multiple pages that have to be excluded ● PDF files crawling can be configured ● Takes robot.txt into consideration.
  • 8. Custom Crawler Libraries ● Crawler4j: An open source web crawler for Java which provides a simple interface for crawling the Web. In a multi-threaded manner. ● Jsoup: Java library that provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. ● BoilerPipe: Provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
  • 10. Why Google analytics ● To Find analytics like top Hits, Exits, Page Views , feedback etc. that too grouped by location. That would help in finding the popular pages in a particular location. ● To display mostly used searched Terms in a particular location. ● To Show trending Terms.
  • 11. How To Integrate GA ● Google provides java sdk that helps you fetch the details using an API. ● Currently we also store details about city, country, Browser, Operating System etc. against a aprticular URL. ● We Fetch 3 types of analytics- Page Analytics, Search Term Analytics and Feed Back analytics Isn't that easy ......
  • 13. Why need Domain Specific Data ● Every Domain has some data that is relevant to it only. Since Idea behind this custom search engine is to crawl websites of a particular domain (And not the whole Web), we need to prepare some data that is specific to it.
  • 14. Domain Specific Data Format [{ "type": "Key Term", "value": "FDI", "weight": 15.0 }, { "type": "Key Term", "value": "Taxation", "weight": 15.0 }, { "type": "State", "value": "Gujarat", "weight": 5.0 }]
  • 16. Why Data aggregation ? ● Now that we have data from all the different sources like Website, Google Analytics, Domain Specific Content and all that in raw form, we need to process and aggregate it in a form so that it can be made searchable. ● We have already captured Page URLs content related data (Using Crawler) and Analytics Information Using GA. Now its time to merge both the information and calculate ranking of a particular URL.
  • 17. Page Ranking ● Page Score (pageViews - exits) / (pageViews + exits) ● FeedbackScore (pageViews + positiveFeedbackCount) / (pageViews + negativeFeedbackCount)
  • 19. Available Options for Search ● Google CSE : A platform provided by Google that allows web developers to feature specialized information in web searches, refine and categorize queries and create customized search engines, based on Google Search ● Elastic Swiftype : A fast, flexible search solution that helps you surface your website’s most relevant content to your audience, customers or users. Both of the above available options provide searching but none of them takes care of Location specific Search and Suggestions, Analytics, Domain Specific Information.
  • 20. How Do we search ● We make use of fast searching capabilities Of Elasticsearch to perform search operation. ● We use Important Features Of ES like Functional Scoring, Fuzzy Query, Cross Fields, Aggregations, nGram Analyzers etc. ● We maintain a Synonym file where we provides Synonyms, abbreviations of the key terms.
  • 22. Challenges Faced Crawler Module ● Concurrency Issue while crawling large amount Of data: Hit / Try and find the number of threads that are suitable enough for the process to be fast but at the same time do not cause 'Out Of Memory' Issue. ● Memory Leak Issue while Crawling: Make Sure You close all the streams properly. Use String Builders whereever possible.