SlideShare a Scribd company logo
How to build a small
distributed search
engine using open
source software
Building a distributed search engine
Search engine subsytems:
●

Page database

●

List of the pages to retrieve

●

Pages retrieval and save

●

Page content parsing

●

Full-text indexing of the contents

●

Graph database of the links for ranking
Building a distributed search engine

Open Source Software
•

Apache Hadoop
•
•
•

•

MapReduce
HDFS
HBase

Apache Lucene
Building a distributed search engine

HDFS
Hadoop Distributed File System
Building a distributed search engine

HDFS – Assumptions and goals
●

Hardware failure

●

Big data

●

Write once / read many

●

Moving computation, not data
Building a distributed search engine
Building a distributed search engine
Building a distributed search engine

Lucene
Building a distributed search engine
Lucene - Inverse Indexing
Term

Doc Id

Weight

JUG
301
198
120

0.97
0.65
0.43

301
278
451
103
763

0.94
0.15
0.87
0.45
0.77

Lugano
Building a distributed search engine
Lucene - Indexing main classes


IndexWriter



Directory



Analyzer



Document



Field
Building a distributed search engine
Lucene - Searching main classes


IndexSearcher



Collector



Query



TopDocs



ScoreDoc
Building a distributed search engine

Lucene - Analyzers






StopWords

”the book is on the table” → [book, table]
Stemming

[paint, paints, painted, …] → paint
Synonims

[cat, feline] → cat
Building a distributed search engine

Lucene - Search options


Fields





Wildcards





Title: JUG
body: ”JUG Lugano”
J?G → [JUG, JAG, ...]
J*G →[JUG, JEEG, JUNG, …]

Fuzzy (basata su vocabolario)


JUG~[n] → [MUG, JAG, …]
Building a distributed search engine

Lucene - Search options


Range





Boost





JUG^5 Lugano
”JUG Lugano”^5

Proximity




Year: [2002 TO 2012]
Name: {Alberto TO Andrea}

”JUG Lugano”~5

Boolean and existance


AND, OR, NOT, (), +, -
Building a distributed search engine

HDFS - Lucene Integration


File copy from/to HDFS



Patch IndexWriter/Director
IndexWriter/Directory



Rewrite of IndexWriter on RAM



Lucene 4
Building a distributed search engine

And now...
Hands on!

More Related Content

What's hot

Hadoop
HadoopHadoop
Hadoop
avnishagr
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
10 mongo db
10 mongo db10 mongo db
10 mongo db
Ahmed Elbassel
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
Nilesh Gule
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
Geek Trainings
 
Google hacks himesh
Google hacks himeshGoogle hacks himesh
Google hacks himesh
Shamika Dharmasiri
 
Elasticsearch tuning
Elasticsearch tuningElasticsearch tuning
Elasticsearch tuning
NIKHIL DUBEY
 
Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020
ILRI
 
Hadoop-BigData
Hadoop-BigDataHadoop-BigData
Hadoop-BigData
Gigin Krishnan
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
Big Boxx Animation Academy
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
HBaseCon
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
Jinchao demo
Jinchao demoJinchao demo
Jinchao demo
Jinchao Lin
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
AEM HUB
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
Sufi Nawaz
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
10 mongo db
10 mongo db10 mongo db
10 mongo db
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
 
Google hacks himesh
Google hacks himeshGoogle hacks himesh
Google hacks himesh
 
Elasticsearch tuning
Elasticsearch tuningElasticsearch tuning
Elasticsearch tuning
 
Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020
 
Hadoop-BigData
Hadoop-BigDataHadoop-BigData
Hadoop-BigData
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Jinchao demo
Jinchao demoJinchao demo
Jinchao demo
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 

Viewers also liked

探索 Everything 背后的技术
探索 Everything 背后的技术探索 Everything 背后的技术
探索 Everything 背后的技术yiwenshengmei
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinx
panjunyong
 
Oracle Text 12c New Features
Oracle Text 12c New FeaturesOracle Text 12c New Features
Oracle Text 12c New Features
Ulrike Schwinn
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search engines
Richard Boulton
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
Andrea Iacono
 
document1-2 FINAL-FINALLL
document1-2 FINAL-FINALLLdocument1-2 FINAL-FINALLL
document1-2 FINAL-FINALLL
Kathyayani Padira
 
Oracle Application Express
Oracle Application ExpressOracle Application Express
Oracle Application Express
HBoone
 
Oracle Text in APEX
Oracle Text in APEXOracle Text in APEX
Oracle Text in APEX
Scott Wesley
 
Oracle application express ppt
Oracle application express pptOracle application express ppt
Oracle application express ppt
Abhinaw Kumar
 
Production Support
Production SupportProduction Support
Production Support
r_shanki
 
Amazon search test case document
Amazon search test case documentAmazon search test case document
Amazon search test case document
Sunil Kumar Gunasekaran
 
SEO Proposal Sample
SEO Proposal SampleSEO Proposal Sample
SEO Proposal Sample
Ian Lurie
 
IC Engine PPt
IC Engine PPtIC Engine PPt
IC Engine PPt
Rocker ʌvinʌsh
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
datasalt
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
Leslie Samuel
 

Viewers also liked (15)

探索 Everything 背后的技术
探索 Everything 背后的技术探索 Everything 背后的技术
探索 Everything 背后的技术
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinx
 
Oracle Text 12c New Features
Oracle Text 12c New FeaturesOracle Text 12c New Features
Oracle Text 12c New Features
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search engines
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
document1-2 FINAL-FINALLL
document1-2 FINAL-FINALLLdocument1-2 FINAL-FINALLL
document1-2 FINAL-FINALLL
 
Oracle Application Express
Oracle Application ExpressOracle Application Express
Oracle Application Express
 
Oracle Text in APEX
Oracle Text in APEXOracle Text in APEX
Oracle Text in APEX
 
Oracle application express ppt
Oracle application express pptOracle application express ppt
Oracle application express ppt
 
Production Support
Production SupportProduction Support
Production Support
 
Amazon search test case document
Amazon search test case documentAmazon search test case document
Amazon search test case document
 
SEO Proposal Sample
SEO Proposal SampleSEO Proposal Sample
SEO Proposal Sample
 
IC Engine PPt
IC Engine PPtIC Engine PPt
IC Engine PPt
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to How to build_a_search_engine

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
Venkatesh Narayanan
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
sheetal sharma
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
keval dalasaniya
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
Krisshhna Daasaarii
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
Imviplav
 
Basic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineBasic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search Engine
Xtremcoin and Geliyoo
 

Similar to How to build_a_search_engine (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Apache drill
Apache drillApache drill
Apache drill
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Hadoop
HadoopHadoop
Hadoop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Basic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineBasic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search Engine
 

Recently uploaded

GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 

Recently uploaded (20)

GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 

How to build_a_search_engine

  • 1. How to build a small distributed search engine using open source software
  • 2. Building a distributed search engine Search engine subsytems: ● Page database ● List of the pages to retrieve ● Pages retrieval and save ● Page content parsing ● Full-text indexing of the contents ● Graph database of the links for ranking
  • 3. Building a distributed search engine Open Source Software • Apache Hadoop • • • • MapReduce HDFS HBase Apache Lucene
  • 4. Building a distributed search engine HDFS Hadoop Distributed File System
  • 5. Building a distributed search engine HDFS – Assumptions and goals ● Hardware failure ● Big data ● Write once / read many ● Moving computation, not data
  • 6. Building a distributed search engine
  • 7. Building a distributed search engine
  • 8. Building a distributed search engine Lucene
  • 9. Building a distributed search engine Lucene - Inverse Indexing Term Doc Id Weight JUG 301 198 120 0.97 0.65 0.43 301 278 451 103 763 0.94 0.15 0.87 0.45 0.77 Lugano
  • 10. Building a distributed search engine Lucene - Indexing main classes  IndexWriter  Directory  Analyzer  Document  Field
  • 11. Building a distributed search engine Lucene - Searching main classes  IndexSearcher  Collector  Query  TopDocs  ScoreDoc
  • 12. Building a distributed search engine Lucene - Analyzers    StopWords  ”the book is on the table” → [book, table] Stemming  [paint, paints, painted, …] → paint Synonims  [cat, feline] → cat
  • 13. Building a distributed search engine Lucene - Search options  Fields    Wildcards    Title: JUG body: ”JUG Lugano” J?G → [JUG, JAG, ...] J*G →[JUG, JEEG, JUNG, …] Fuzzy (basata su vocabolario)  JUG~[n] → [MUG, JAG, …]
  • 14. Building a distributed search engine Lucene - Search options  Range    Boost    JUG^5 Lugano ”JUG Lugano”^5 Proximity   Year: [2002 TO 2012] Name: {Alberto TO Andrea} ”JUG Lugano”~5 Boolean and existance  AND, OR, NOT, (), +, -
  • 15. Building a distributed search engine HDFS - Lucene Integration  File copy from/to HDFS  Patch IndexWriter/Director IndexWriter/Directory  Rewrite of IndexWriter on RAM  Lucene 4
  • 16. Building a distributed search engine And now... Hands on!