SlideShare a Scribd company logo
1 of 21
www.sti-innsbruck.at© Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at
Apache Lucene
Ioan Toma
based on slides from Aaron Bannert aaron@codemass.com
www.sti-innsbruck.at
What is Apache Lucene?
“Apache Lucene(TM) is a high-performance, full-featured text search
engine library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially cross-
platform.”
- from http://lucene.apache.org/
www.sti-innsbruck.at
Features
• Scalable, High-Performance Indexing
– over 95GB/hour on modern hardware
– small RAM requirements -- only 1MB heap
– incremental indexing as fast as batch indexing
– index size roughly 20-30% the size of text indexed
• Powerful, Accurate and Efficient Search Algorithms
– ranked searching -- best results returned first
– many powerful query types: phrase queries, wildcard queries, proximity queries, range
queries and more
– fielded searching (e.g., title, author, contents)
– date-range searching
– sorting by any field
– multiple-index searching with merged results
– allows simultaneous update and searching
www.sti-innsbruck.at
Features
• Cross-Platform Solution
– Available as Open Source software under the Apache License which
lets you use Lucene in both commercial and Open Source programs
– 100%-pure Java
– Implementations in other programming languages available that are
index-compatible
– CLucene - Lucene implementation in C++
– Lucene.Net - Lucene implementation in .NET
– Zend Search - Lucene implementation in the Zend Framework for PHP 5
www.sti-innsbruck.at
Ranked Searching
1. Phrase Matching
2. Keyword Matching
– Prefer more unique terms first
– Scoring and ranking takes into account the uniqueness of each term
when determining a document’s relevance score
www.sti-innsbruck.at
Flexible Queries
• Phrases
“star wars”
• Wildcards
star*
• Ranges
{star-stun}
[2006-2007]
• Boolean Operators
star AND wars
www.sti-innsbruck.at
Field-specific Queries
• Field-specific queries can be used to target specific
fields in the Document Index.
• For example
title:”star wars”
AND
director:”George Lucas”
www.sti-innsbruck.at
Sorting
• Can sort any field in a Document
– For example, by Price, Release Date, Amazon Sales Rank, etc…
• By default, Lucene will sort results by their relevance
score. Sorting by any other field in a Document is also
supported.
www.sti-innsbruck.at
Lucene inTernals
9
www.sti-innsbruck.at
Everything is a Document
• A document can represent anything textual:
– Word Document
– DVD (the textual metadata only)
– Website Member (name, ID, etc…)
• A Lucene Document need not refer to an actual file on a disk, it
could also resemble a row in a relational database.
• Developers are responsible for turning their own data sets into
Lucene Documents
• A document is seen as a list of fields, where a field has a
name an a value
www.sti-innsbruck.at
Indexes
• The unit of indexing in Lucene is a term. A term is often
a word.
• Indexes track term frequencies
• Every term maps back to a Document
• Lucene uses inverted index which allows Lucene to
quickly locate every document currently associated with
a given set up input search terms.
www.sti-innsbruck.at
Basic Indexing
1. Parse different types of
documents (HTML, PDF,
Word, text files, etc.)
2. Extract tokens and related info
(Lucene Analyser)
3. Add the Document to an Index
Lucene provide a standard
analyzer for English and latin
based languages
.
www.sti-innsbruck.at
Basic Searching
1. Create a Query
• (eg. by parsing user input)
2. Open an Index
3. Search the Index
• Use the same Analyzer as before
4. Iterate through returned Documents
• Extract out needed results
• Extract out result scores (if needed)
www.sti-innsbruck.at
Lucene as SOA
1. Design an HTTP query syntax
– GET queries
– XML for results
2. Wrap Tomcat around core code
3. Write a Client Library
As it follows SOA principles, basic building blocks such
as load balancers can be deployed to quickly scale up
the capacity of the search subsystem.
www.sti-innsbruck.at
Lucene as SOA Diagram
Single-Machine Architecture
Lucene-based Application includes three components
1. Lucene Custom Client Library
2. Search Service
3. Custom Core Search Library
www.sti-innsbruck.at
Lucene scalability
16
www.sti-innsbruck.at
Scalability Limits
• 3 main scalability factors:
– Query Rate
– Index Size
– Update Rate
www.sti-innsbruck.at
Query Rate Scalability
• Lucene is already fast
– Built-in caching
• Easy solution for heavy workloads:
(gives near-linear scaling)
– Add more query servers behind a load balancer
– Can grow as your traffic grows
www.sti-innsbruck.at
Lucene as SOA Diagram
High-Scale Multi-Machine Architecture
www.sti-innsbruck.at
Index Size Scalability
• Can easily handle millions of Documents
• Lucene is very commonly deployed into systems with
10s of millions of Documents.
• Main limits related to Index size that one is likely to run in
to will be disk capacity and disk I/O limits.
If you need bigger:
• Built-in multi-machine capabilities
– Can merge multiple remote indexes at query-time.
www.sti-innsbruck.at
Update Rate
• Lucene is threadsafe
– Can update and query at the same time
• I/O is limiting factor
Strategies for achieving even higher update rates:
– Vertical Split – for big workloads (Centralized Index Building)
– Build indexes apart from query service
– Push updated indexes on intervals
– Horizontal Split – for huge workloads
– Split data into columns
– Merge columns for queries
– Columns only receive their own data for updates

More Related Content

What's hot

Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Matteo Merli
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
 
Elastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarElastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarStreamNative
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar clusterShivji Kumar Jha
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformMatteo Merli
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)Shivji Kumar Jha
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
 
What is apache Kafka?
What is apache Kafka?What is apache Kafka?
What is apache Kafka?Kenny Gorman
 
Aws overview part 1(iam and storage services)
Aws overview   part 1(iam and storage services)Aws overview   part 1(iam and storage services)
Aws overview part 1(iam and storage services)Parag Patil
 
Elasticsearch tuning
Elasticsearch tuningElasticsearch tuning
Elasticsearch tuningNIKHIL DUBEY
 
Uniform Resource Locator (URL)
Uniform Resource Locator (URL)Uniform Resource Locator (URL)
Uniform Resource Locator (URL)Mary Daine Napuli
 

What's hot (15)

Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
 
Elastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarElastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache Pulsar
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Apache Multiview Vulnerability
Apache Multiview VulnerabilityApache Multiview Vulnerability
Apache Multiview Vulnerability
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
What is apache Kafka?
What is apache Kafka?What is apache Kafka?
What is apache Kafka?
 
Aws overview part 1(iam and storage services)
Aws overview   part 1(iam and storage services)Aws overview   part 1(iam and storage services)
Aws overview part 1(iam and storage services)
 
Elasticsearch tuning
Elasticsearch tuningElasticsearch tuning
Elasticsearch tuning
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Uniform Resource Locator (URL)
Uniform Resource Locator (URL)Uniform Resource Locator (URL)
Uniform Resource Locator (URL)
 

Viewers also liked

Empirical evaluation-of-popular-mobile-hotel-booking-apps
Empirical evaluation-of-popular-mobile-hotel-booking-appsEmpirical evaluation-of-popular-mobile-hotel-booking-apps
Empirical evaluation-of-popular-mobile-hotel-booking-appsSTIinnsbruck
 
Common value management based on effective and efficent
Common value management   based on effective and efficentCommon value management   based on effective and efficent
Common value management based on effective and efficentSTIinnsbruck
 
On line value management 0
On line value management 0On line value management 0
On line value management 0STIinnsbruck
 
Deployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webDeployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webSTIinnsbruck
 
Engagement handouts
Engagement handoutsEngagement handouts
Engagement handoutsSTIinnsbruck
 
Dynamic broadcasting
Dynamic broadcastingDynamic broadcasting
Dynamic broadcastingSTIinnsbruck
 
Tv evaluation 12032014
Tv evaluation 12032014Tv evaluation 12032014
Tv evaluation 12032014STIinnsbruck
 

Viewers also liked (19)

Click bank
Click bankClick bank
Click bank
 
Empirical evaluation-of-popular-mobile-hotel-booking-apps
Empirical evaluation-of-popular-mobile-hotel-booking-appsEmpirical evaluation-of-popular-mobile-hotel-booking-apps
Empirical evaluation-of-popular-mobile-hotel-booking-apps
 
Paper 28
Paper 28Paper 28
Paper 28
 
Common value management based on effective and efficent
Common value management   based on effective and efficentCommon value management   based on effective and efficent
Common value management based on effective and efficent
 
Traveltainment
TraveltainmentTraveltainment
Traveltainment
 
Bowi
BowiBowi
Bowi
 
On line value management 0
On line value management 0On line value management 0
On line value management 0
 
Deployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webDeployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_web
 
Tripwolf
TripwolfTripwolf
Tripwolf
 
Peer index
Peer indexPeer index
Peer index
 
Engagement handouts
Engagement handoutsEngagement handouts
Engagement handouts
 
Twibes
TwibesTwibes
Twibes
 
Opentravel
OpentravelOpentravel
Opentravel
 
Dynamic broadcasting
Dynamic broadcastingDynamic broadcasting
Dynamic broadcasting
 
Tnooz
TnoozTnooz
Tnooz
 
Tv evaluation 12032014
Tv evaluation 12032014Tv evaluation 12032014
Tv evaluation 12032014
 
Prezi
PreziPrezi
Prezi
 
Talent
TalentTalent
Talent
 
Pixmeaway
PixmeawayPixmeaway
Pixmeaway
 

Similar to Apache lucene

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Petter Skodvin-Hvammen
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallDr. Haxel Consult
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersNiko Neugebauer
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureAmazon Web Services
 
Develop open source search engine
Develop open source search engineDevelop open source search engine
Develop open source search engineNAILBITER
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonChetan Giridhar
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 

Similar to Apache lucene (20)

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
Ceate a Scalable Cloud Architecture
Ceate a Scalable Cloud ArchitectureCeate a Scalable Cloud Architecture
Ceate a Scalable Cloud Architecture
 
Develop open source search engine
Develop open source search engineDevelop open source search engine
Develop open source search engine
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 

More from STIinnsbruck

More from STIinnsbruck (20)

Unister
UnisterUnister
Unister
 
Twoo
TwooTwoo
Twoo
 
Tweet deck 2012-01-02
Tweet deck 2012-01-02Tweet deck 2012-01-02
Tweet deck 2012-01-02
 
Tv handbook revised_100120141
Tv handbook revised_100120141Tv handbook revised_100120141
Tv handbook revised_100120141
 
Tv feratel 13032014
Tv feratel 13032014Tv feratel 13032014
Tv feratel 13032014
 
T vb publication_rules_11032014
T vb publication_rules_11032014T vb publication_rules_11032014
T vb publication_rules_11032014
 
T vb mapping_implementation_25032014
T vb mapping_implementation_25032014T vb mapping_implementation_25032014
T vb mapping_implementation_25032014
 
T vb alignment_022814_0
T vb alignment_022814_0T vb alignment_022814_0
T vb alignment_022814_0
 
Ttr 20130701
Ttr 20130701Ttr 20130701
Ttr 20130701
 
Ttg mapping to_schema.org_
Ttg mapping to_schema.org_Ttg mapping to_schema.org_
Ttg mapping to_schema.org_
 
Ttb 08042014
Ttb 08042014Ttb 08042014
Ttb 08042014
 
Trust you
Trust youTrust you
Trust you
 
Tripbirds
TripbirdsTripbirds
Tripbirds
 
Travelaudience
TravelaudienceTravelaudience
Travelaudience
 
Tourismuszukunft
TourismuszukunftTourismuszukunft
Tourismuszukunft
 
Tourismusverband innsbruck 24.09.2013
Tourismusverband innsbruck 24.09.2013Tourismusverband innsbruck 24.09.2013
Tourismusverband innsbruck 24.09.2013
 
Tourism link
Tourism linkTourism link
Tourism link
 
Tourcomm germany.com
Tourcomm germany.com Tourcomm germany.com
Tourcomm germany.com
 
Tor
TorTor
Tor
 
Toocan
ToocanToocan
Toocan
 

Recently uploaded

Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20rejz122017
 
The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...Kayode Fayemi
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
History of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth deathHistory of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth deathphntsoaki
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...ZurliaSoop
 
"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXRMegan Campos
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.thamaeteboho94
 
ECOLOGY OF FISHES.pptx full presentation
ECOLOGY OF FISHES.pptx full presentationECOLOGY OF FISHES.pptx full presentation
ECOLOGY OF FISHES.pptx full presentationFahadFazal7
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
 
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORNLITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORNtntlai16
 
Using AI to boost productivity for developers
Using AI to boost productivity for developersUsing AI to boost productivity for developers
Using AI to boost productivity for developersTeri Eyenike
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESfuthumetsaneliswa
 
2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdfNancy Goebel
 
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptxBEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptxthusosetemere
 

Recently uploaded (19)

Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
 
Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20
 
The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
History of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth deathHistory of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth death
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
ECOLOGY OF FISHES.pptx full presentation
ECOLOGY OF FISHES.pptx full presentationECOLOGY OF FISHES.pptx full presentation
ECOLOGY OF FISHES.pptx full presentation
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORNLITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
 
Using AI to boost productivity for developers
Using AI to boost productivity for developersUsing AI to boost productivity for developers
Using AI to boost productivity for developers
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 
2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf
 
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptxBEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 

Apache lucene

  • 1. www.sti-innsbruck.at© Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.com
  • 2. www.sti-innsbruck.at What is Apache Lucene? “Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross- platform.” - from http://lucene.apache.org/
  • 3. www.sti-innsbruck.at Features • Scalable, High-Performance Indexing – over 95GB/hour on modern hardware – small RAM requirements -- only 1MB heap – incremental indexing as fast as batch indexing – index size roughly 20-30% the size of text indexed • Powerful, Accurate and Efficient Search Algorithms – ranked searching -- best results returned first – many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more – fielded searching (e.g., title, author, contents) – date-range searching – sorting by any field – multiple-index searching with merged results – allows simultaneous update and searching
  • 4. www.sti-innsbruck.at Features • Cross-Platform Solution – Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs – 100%-pure Java – Implementations in other programming languages available that are index-compatible – CLucene - Lucene implementation in C++ – Lucene.Net - Lucene implementation in .NET – Zend Search - Lucene implementation in the Zend Framework for PHP 5
  • 5. www.sti-innsbruck.at Ranked Searching 1. Phrase Matching 2. Keyword Matching – Prefer more unique terms first – Scoring and ranking takes into account the uniqueness of each term when determining a document’s relevance score
  • 6. www.sti-innsbruck.at Flexible Queries • Phrases “star wars” • Wildcards star* • Ranges {star-stun} [2006-2007] • Boolean Operators star AND wars
  • 7. www.sti-innsbruck.at Field-specific Queries • Field-specific queries can be used to target specific fields in the Document Index. • For example title:”star wars” AND director:”George Lucas”
  • 8. www.sti-innsbruck.at Sorting • Can sort any field in a Document – For example, by Price, Release Date, Amazon Sales Rank, etc… • By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also supported.
  • 10. www.sti-innsbruck.at Everything is a Document • A document can represent anything textual: – Word Document – DVD (the textual metadata only) – Website Member (name, ID, etc…) • A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database. • Developers are responsible for turning their own data sets into Lucene Documents • A document is seen as a list of fields, where a field has a name an a value
  • 11. www.sti-innsbruck.at Indexes • The unit of indexing in Lucene is a term. A term is often a word. • Indexes track term frequencies • Every term maps back to a Document • Lucene uses inverted index which allows Lucene to quickly locate every document currently associated with a given set up input search terms.
  • 12. www.sti-innsbruck.at Basic Indexing 1. Parse different types of documents (HTML, PDF, Word, text files, etc.) 2. Extract tokens and related info (Lucene Analyser) 3. Add the Document to an Index Lucene provide a standard analyzer for English and latin based languages .
  • 13. www.sti-innsbruck.at Basic Searching 1. Create a Query • (eg. by parsing user input) 2. Open an Index 3. Search the Index • Use the same Analyzer as before 4. Iterate through returned Documents • Extract out needed results • Extract out result scores (if needed)
  • 14. www.sti-innsbruck.at Lucene as SOA 1. Design an HTTP query syntax – GET queries – XML for results 2. Wrap Tomcat around core code 3. Write a Client Library As it follows SOA principles, basic building blocks such as load balancers can be deployed to quickly scale up the capacity of the search subsystem.
  • 15. www.sti-innsbruck.at Lucene as SOA Diagram Single-Machine Architecture Lucene-based Application includes three components 1. Lucene Custom Client Library 2. Search Service 3. Custom Core Search Library
  • 17. www.sti-innsbruck.at Scalability Limits • 3 main scalability factors: – Query Rate – Index Size – Update Rate
  • 18. www.sti-innsbruck.at Query Rate Scalability • Lucene is already fast – Built-in caching • Easy solution for heavy workloads: (gives near-linear scaling) – Add more query servers behind a load balancer – Can grow as your traffic grows
  • 19. www.sti-innsbruck.at Lucene as SOA Diagram High-Scale Multi-Machine Architecture
  • 20. www.sti-innsbruck.at Index Size Scalability • Can easily handle millions of Documents • Lucene is very commonly deployed into systems with 10s of millions of Documents. • Main limits related to Index size that one is likely to run in to will be disk capacity and disk I/O limits. If you need bigger: • Built-in multi-machine capabilities – Can merge multiple remote indexes at query-time.
  • 21. www.sti-innsbruck.at Update Rate • Lucene is threadsafe – Can update and query at the same time • I/O is limiting factor Strategies for achieving even higher update rates: – Vertical Split – for big workloads (Centralized Index Building) – Build indexes apart from query service – Push updated indexes on intervals – Horizontal Split – for huge workloads – Split data into columns – Merge columns for queries – Columns only receive their own data for updates

Editor's Notes

  1. <number>
  2. <number>
  3. <number>
  4. <number> Lucene is very smart about how it ranks results. Results are not simply returned for every document that has a matching term from the search query. For example, one way that Lucene is smart about scoring and ranking is how it takes into account the uniqueness of each term when determining a document’s relevance score.
  5. <number> This is just a small subset of the types of queries that Lucene can support. Some query types such as wildcard and range queries have a potential to cause heavy load on the Lucene server, so Lucene makes it easy to disable certain types of queries while allowing all others to proceed through the system. This gives programmers better control and allows the system performance to be more predictable.
  6. <number> Field-specific queries can be used to target specific fields in the Document Index. Note that this example uses a Boolean Query to unify two field-specific queries.
  7. <number> By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also supported.
  8. <number> In Lucene, everything is a Document. A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database. Each developer is responsible for turning their own data sets into Lucene Documents. Lucene comes with a number of 3rd party contributions, including examples for parsing structured data files such as XML documents and Word files.
  9. <number> The type of index used in Lucene and other full-text search engines is sometimes also called an “inverted index”. This index is what allows Lucene to quickly locate every document currently associated with a given set up input search terms.
  10. <number> Lucene comes with a default Analyzer which works well for unstructured English text, however it often performs incorrect normalizations on non-English texts. Lucene makes it easy to build custom Analyzers, and provides a number of helpful building blocks with which to build your own. Lucene even includes a number of “stemming” algorithms for various languages, which can improve document retrieval accuracy when the source language is known at indexing time.
  11. <number> It is important that Queries use the same (or very similar) Analyzer that was used when the index was created. The reason for this is due to the way that the Analyzer performs normalization computations on the input text. In order to find Documents using the same type of text that was used when indexing, that text must be normalized in the same way that the original data was normalized.
  12. <number> One common architecture within which to develop Lucene applications is the SOA, or Service-Oriented-Architecture. This decoupling of Lucene applications from a front-end UI-oriented web application is that it allows each service to scale independently of each other. Also, by using standard protocols such as HTTP and XML to communicate between the web application and the Lucene application, basic building blocks such as load balancers can be deployed to quickly scale up the capacity of the search subsystem.
  13. <number> This is an example of a simple Lucene-based Application deployed as a separate internal Service. In this example, the three components that make up the Lucene-based Application are the “Lucene Custom Client Library”, the “Search Service” and the “Custom Core Search Library”. In the ubiquitous MVC pattern that is commonly deployed at the Web Container tier of standard LAMP stacks, the “Lucene Custom Client Library” pictured here would plug directly into its own DAO in the persistence layer.
  14. <number>
  15. <number> Lucene was built from the beginning to be a high-performance full-text search engine. Despite the performance constraints of Java and the JVM, Lucene is blazingly fast.
  16. <number> Here is an example showing how a load balancer can be deployed to add additional capacity and redundancy to an internal Lucene-based Search Application. To achieve high-availability, we must simply deploy N+1 Lucene boxes, where N is the maximum number of boxes needed to service the peak search load.
  17. <number> Lucene is very commonly deployed into systems with 10s of millions of Documents. Although query performance can degrade as more Documents are added to the index, the growth factor is very low. The main limits related to Index size that you are likely to run in to will be disk capacity and disk I/O limits. Luckily for developers facing these limitations with their massive data sets, Lucene provides built-in methods to allow queries to span multiple remote Lucene indexes.
  18. <number> Update rate is often a limiting factor for Lucene applications which require up-to-the-second index updates. Higher levels of performance can be achieved by reducing the need for instant indexing. Further performance can be achieved through the following strategies: