SlideShare a Scribd company logo
Java One 2013 
Java One 2013 
Make Text Search “Work” 
for your Apps 
Anirban Mukherjee 
amukherjee@verisign.com 
Manish Maheshwari 
mmaheshwari@verisign.com 
08-May-2013
Speakers 
Anirban 
Software Architect, Verisign 
Manish 
Software Architect, Verisign 
Verisign Public 2
Agenda 
• Overview of Text Search 
• What is Text Search 
• Differences from traditional database search 
• Text Search implementation for regular web applications 
• Relational Databases vs Text Search Engines 
• Recommended design principles 
Verisign Public 3
Overview of Text Search 
Verisign Public 4
What is Text Search (1/2) 
• Also called Full-text search 
• Enter a few keywords 
• Get results fast with most relevant matches on top 
• Can work well on unstructured information 
• Documents e.g. resumes, papers 
• Free text fields like Titles and Descriptions 
• Non-exact or approximate matches may be returned 
Verisign Public 5
What is Text Search (2/2) 
• Origins in document processing systems and web search 
• Now a de-facto requirement for regular web applications 
• Enterprise applications 
• Cloud apps / SaaS solutions for Enterprises 
• Expanding into Real-time analytics 
• We will focus on apps with relational database stores 
• Unique challenges 
• Don’t fit Text engines as naturally as Document oriented 
data stores 
• Frequent entity modifications are usually involved 
Verisign Public 6
Lookup-style search: Example 
• Explicit fields 
• No relevance ranking for results 
• Traditional RDBMS style implementation using SQL 
• Wildcards can be used for partial matches (uses SQL “like”) 
• Limits on results, pagination often absent 
Verisign Public 7
Text Search: Example 
• No explicit fields specified in input (may be present in Advanced Search) 
• Keyword based operation 
• Results are ordered by relevance and paginated 
• No need to use wildcards 
• Auto-suggestion is often present while typing input keywords 
Verisign Public 8
More Text Search Examples 
• Single input field or multiple fields combined with booleans 
• Usability considerations come into play 
• Terms or keywords have to be input in both cases 
Verisign Public 9
Text Search features: Summary 
• Term based search at fast speeds 
• index returns docs matching input terms fast 
• boolean AND, OR, NOT combinations can be used 
• Relevance 
• usually based on TF (frequency of the term in a document) and IDF 
(rarity of the term across all documents) 
• other factors can be incorporated if needed 
• Approximate matches 
• Stemming and Synonyms 
• Fuzzy matches and spelling auto-corrections 
Verisign Public 10
Inverted Index/ Text index 
• Helps in fast retrieval of documents matching terms 
• Index creating involves a good bit of processing 
• Different fields in a document can be indexed differently 
• Indexing is very closely tied to search queries 
• Text Engines can handle many indexes 
Verisign Public 11
Popular Java-based Text Search 
libraries and platforms 
Verisign Public 12
RDBMS Full-text Search components 
• Proprietary extensions to SQL to support text search 
• Pros: Single data source for Apps 
• Apps can interact with the database only 
• Cons: Limits on flexibility, portability and perhaps scalability 
Verisign Public 13
Typical Text Search App architecture 
• RelationalEntity – TextDoc mappings have to be done 
properly 
• Only a subset of data should go to text index 
• DB is primary datastore 
• Text searches always hit text index first 
Verisign Public 14
Location of the Text Search Engine 
• Library/plugin 
• Lucene 
• Hibernate Search 
• Database Full-text 
• Oracle Text 
• MySQL Full-text 
• Search servers 
• Solr 
• Elasticsearch 
Verisign Public 15
Relation Databases vs Text 
Engines 
Verisign Public 16
RDBMS vs Text Engine: Structural Mismatch 
• Relational databases 
• many data types 
• tables represent entities 
• entities have relationships between them 
• normalized schema and joins 
• Text engines 
• fundamentally only type is string 
• flat documents 
• no relationships between documents 
• joins between documents are not supported 
• Relationships have to be flattened and embedded into text documents 
• duplication of data 
• can be difficult to implement 
• relationships can be complex and 2-way too 
Verisign Public 17
RDBMS vs Text Engine: Sync Mismatch 
• Data updates have to be performed in two different places 
• RDBMS and Text Engine 
• Structural mismatch can make this fragile 
• change to a single entity can affect many documents 
• updates occur from many places in the app 
• Text engines are not transactional like RDBMS 
• Not all Text Engines are near real-time capable 
• Elasticsearch focuses on near real-time updates 
• “commit” for Text engines is expensive 
Verisign Public 18
RDBMS vs Text Engine: Retrieval Mismatch 
• Text Engine should typically have only a subset of the full 
data 
• Text index is not a database 
• Too much data in the index makes it slow 
• Purpose of text index is to provide initial result page(s) 
• Document type plus entity primary key from the database 
uniquely identifies a document 
• Represents an entity (often partial) 
• Full details can be retrieved from database 
• Ideally should use at most a single database query per result view 
Verisign Public 19
Design Principles for Text 
Search Apps 
Verisign Public 20
Design Principles for Text Search apps 
• We consider regular web apps which have relational 
databases as the primary data source 
• User confidence in the search solution is vital 
• Some principles may require thinking that departs from 
traditional database apps 
Verisign Public 21
P1: The most basic searches must work perfectly 
first 
Problem: If the app does not return good results for the basic 
cases, users will lose faith very easily. 
• E.g.: If an exact Title is entered, user certainly expects it to be 
listed right on top 
• Stemming, synonyms etc. must not jeopardize exact matches 
• Precision is more important than recall 
• Test cases should cover these elaborately 
• Make it clear to users that matches are primarily keyword based 
Verisign Public 22
P2: Text Indexes should be used for all applicable 
views of the data (1/2) 
Problem: Sync mismatch can cause loss of confidence 
since data showing up in the tables may not be showing up 
in searches. 
• The data mismatch may arise due to regular indexing 
delays or application bugs. 
• Avoid views built directly from the database tables while 
bypassing the text index 
• Detection of indexing issues/errors happens early 
• corrective action can be taken fast 
Verisign Public 23
P2: Text Indexes should be used for all applicable 
views of the data (2/2) 
• Admin views can have a secondary option to look up the 
database directly in case of problems 
• Elasticsearch and latest versions of Solr strive to make 
index updates available in near real-time 
Verisign Public 24
P3: Accommodate regular Text index re-creation 
(1/2) 
Problem: Index re-creation can be time consuming and 
involve application downtime. 
• Improvements and enhancements to text search typically require full 
index re-creation. 
• Text indexes may also get out of sync with the primary database 
store due to errors and bugs. 
• Text indexes are not as resilient or robust as databases with respect 
to durability. 
Verisign Public 25
P3: Accommodate regular Text index re-creation 
(2/2) 
• Embrace the need for full index re-creation 
• Devise ways to do it smoothly on demand and regularly 
• Strategy 1: Keep alternate indexes in active/passive. 
Periodically, 
• re-create the passive and switch it to active mode 
• switch the old active to passive mode (to be re-created next) 
• Strategy 2: Store timestamp for every doc at indexing time 
• re-index all documents using the database data 
• Remove all docs with timestamp older than the re-index start time 
Verisign Public 26
P4: Indexing and Searches are closely tied - think of 
both together 
Problem: Enhancements are needed to the search. Addition 
of more searchable data is breaking older stuff. 
• Unlike in the database, index updates are strongly coupled 
to the types of queries 
• not viable to do data modeling work first and think of queries later 
• Strive to limit the amount of indexed data 
• Bulk indexing is much slower than bulk database loads 
• Scale out the search servers as data grows 
• Performance testing is needed with a focus on frequent searches 
Verisign Public 27
P5: Avoid treating the Text Engine as a relational 
store 
Problem: Searches have become really slow as the data has 
grown. Each subsequent page also takes a long time to load. 
• Anti-pattern: Direct one-to-one table to doc mapping with “joins” 
inside the App 
• Text engines are not relational databases 
• App joins will tend to collapse as data grows, they may involve many 
Text engine queries 
• Strive to make the summary results load directly from the Text 
Engine 
• Initial results list page should have minimal fields 
• Only minimum essential fields have to be in the index 
• Avoid sorts on many fields, consider faceting instead 
Verisign Public 28
P6: Avoid wildcards in user input (1/2) 
Problem: Users are not fully satisfied with keyword based 
matches. They want partial matches within the keywords too. 
• Search engines allow wildcards but there are major pitfalls 
• Relevance is lost, results are returned in arbitrary order similar to 
SQL “like” or grep 
• If stemming is in use, stems and not the original terms are in 
present the index. So wildcards may not give expected matches 
• E.g. management has Porter stem manag which is what gets into the 
index. So it no longer matches the wildcard pattern manage* 
Verisign Public 29
P6: Avoid wildcards in user input (2/2) 
• Make use of auto-suggestion on a small number of 
important fields as the user types the input 
• Tends to be quite performant and lightweight if implemented 
properly 
• Can usually be implemented with edge n-grams for prefix matches 
• Try to avoid full n-grams for arbitrary substring matches 
• Number of edge n-grams is O(L), number of full n-grams is O(L2) 
Verisign Public 30
Popular form of Auto-suggestion today 
Verisign Public 31
P7: Analyze and improve 
Problem: Things are evolving rapidly and data volumes are 
increasing. It is hard to keep pace and improve performance 
and user experience. 
• Logs should be regularly analyzed for user behavior 
• Performance testing needs to be done at higher loads 
• Platform upgrades may be a reality 
• Rate-limiting needs to be implemented 
• But changes need to be resisted too … 
Verisign Public 32
Conclusion 
• Text search is still evolving rapidly 
• Lucene is 12+ years old but is still very active 
• along with Solr and Elasticsearch 
• Cloud apps and high traffic websites need to scale up 
constantly 
• Relational databases backends are not going away soon 
• Good Text search designs will continue help 
• Enterprise search is now expanding to real-time analytics 
Verisign Public 33
References 
• Hibernate Search in Action, Manning Publishers 
• http://www.elasticsearch.org 
• http://www.lucidworks.com/ 
Verisign Public 34
Thank You 
© 2013 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and 
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United 
States and in foreign countries. All other trademarks are property of their respective owners.

More Related Content

What's hot

Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
semanticsconference
 
Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
MongoDB
 
Building a scalable search architecture in share point 2013
Building a scalable search architecture in share point 2013Building a scalable search architecture in share point 2013
Building a scalable search architecture in share point 2013
Terrence Nguyen
 
SharePoint 2013 Search Topology and Optimization
SharePoint 2013 Search Topology and OptimizationSharePoint 2013 Search Topology and Optimization
SharePoint 2013 Search Topology and Optimization
Mike Maadarani
 
Search-Driven Applications with SharePoint 2013 (#SBSBE16)
Search-Driven Applications with SharePoint 2013 (#SBSBE16)Search-Driven Applications with SharePoint 2013 (#SBSBE16)
Search-Driven Applications with SharePoint 2013 (#SBSBE16)
Maximilian Melcher
 
SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1
SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1
SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1
John F. Holliday
 
SharePoint 2013 – the upgrade story
SharePoint 2013 – the upgrade storySharePoint 2013 – the upgrade story
SharePoint 2013 – the upgrade story
SPC Adriatics
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
Ebenezer Daniel
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
Petter Skodvin-Hvammen
 
SharePoint 2013 search improvements
SharePoint 2013 search improvementsSharePoint 2013 search improvements
SharePoint 2013 search improvements
Kunaal Kapoor
 
Inside the mind of a SharePoint Solutions Architect
Inside the mind of a SharePoint Solutions ArchitectInside the mind of a SharePoint Solutions Architect
Inside the mind of a SharePoint Solutions Architect
Noorez Khamis
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
SPC Adriatics
 
Data Security and Protection in DevOps
Data Security and Protection in DevOps Data Security and Protection in DevOps
Data Security and Protection in DevOps
Karen Lopez
 
Алексей Веркеенко "Symfony2 & REST API"
Алексей Веркеенко "Symfony2 & REST API" Алексей Веркеенко "Symfony2 & REST API"
Алексей Веркеенко "Symfony2 & REST API"
Fwdays
 
From the field! PowerApps in production
From the field! PowerApps in productionFrom the field! PowerApps in production
From the field! PowerApps in production
Nicolas Georgeault
 
How to Design a Good Database for Your Application
How to Design a Good Database for Your ApplicationHow to Design a Good Database for Your Application
How to Design a Good Database for Your Application
Nur Hidayat
 
Taking Cross References to the Next Level: Reltables for Non-Topic Elements
Taking Cross References to the Next Level: Reltables for Non-Topic ElementsTaking Cross References to the Next Level: Reltables for Non-Topic Elements
Taking Cross References to the Next Level: Reltables for Non-Topic Elements
Contrext Solutions
 
Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016
Martin Voigt
 
W-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database PerformanceW-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database Performance
Alois Reitbauer
 
Logical architecture considerations for SharePoint 2013
Logical architecture considerations for SharePoint 2013Logical architecture considerations for SharePoint 2013
Logical architecture considerations for SharePoint 2013
Dinusha Kumarasiri
 

What's hot (20)

Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
 
Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
 
Building a scalable search architecture in share point 2013
Building a scalable search architecture in share point 2013Building a scalable search architecture in share point 2013
Building a scalable search architecture in share point 2013
 
SharePoint 2013 Search Topology and Optimization
SharePoint 2013 Search Topology and OptimizationSharePoint 2013 Search Topology and Optimization
SharePoint 2013 Search Topology and Optimization
 
Search-Driven Applications with SharePoint 2013 (#SBSBE16)
Search-Driven Applications with SharePoint 2013 (#SBSBE16)Search-Driven Applications with SharePoint 2013 (#SBSBE16)
Search-Driven Applications with SharePoint 2013 (#SBSBE16)
 
SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1
SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1
SPEVO13 - Dev212 - Document Assembly Deep Dive Part 1
 
SharePoint 2013 – the upgrade story
SharePoint 2013 – the upgrade storySharePoint 2013 – the upgrade story
SharePoint 2013 – the upgrade story
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
SharePoint 2013 search improvements
SharePoint 2013 search improvementsSharePoint 2013 search improvements
SharePoint 2013 search improvements
 
Inside the mind of a SharePoint Solutions Architect
Inside the mind of a SharePoint Solutions ArchitectInside the mind of a SharePoint Solutions Architect
Inside the mind of a SharePoint Solutions Architect
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
 
Data Security and Protection in DevOps
Data Security and Protection in DevOps Data Security and Protection in DevOps
Data Security and Protection in DevOps
 
Алексей Веркеенко "Symfony2 & REST API"
Алексей Веркеенко "Symfony2 & REST API" Алексей Веркеенко "Symfony2 & REST API"
Алексей Веркеенко "Symfony2 & REST API"
 
From the field! PowerApps in production
From the field! PowerApps in productionFrom the field! PowerApps in production
From the field! PowerApps in production
 
How to Design a Good Database for Your Application
How to Design a Good Database for Your ApplicationHow to Design a Good Database for Your Application
How to Design a Good Database for Your Application
 
Taking Cross References to the Next Level: Reltables for Non-Topic Elements
Taking Cross References to the Next Level: Reltables for Non-Topic ElementsTaking Cross References to the Next Level: Reltables for Non-Topic Elements
Taking Cross References to the Next Level: Reltables for Non-Topic Elements
 
Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016
 
W-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database PerformanceW-JAX Performance Workshop - Database Performance
W-JAX Performance Workshop - Database Performance
 
Logical architecture considerations for SharePoint 2013
Logical architecture considerations for SharePoint 2013Logical architecture considerations for SharePoint 2013
Logical architecture considerations for SharePoint 2013
 

Viewers also liked

Institutions and the market for corporate control in Asia: an empirical analysis
Institutions and the market for corporate control in Asia: an empirical analysisInstitutions and the market for corporate control in Asia: an empirical analysis
Institutions and the market for corporate control in Asia: an empirical analysis
Alberto Asquer
 
MY C.V
MY C.VMY C.V
The Future Company
The Future CompanyThe Future Company
The Future Company
재원 서
 
Brunch Menu March 2015
Brunch Menu March 2015Brunch Menu March 2015
Brunch Menu March 2015
Caley Chastain
 
El futuro de la prensa
El futuro de la prensaEl futuro de la prensa
El futuro de la prensa
Jordi Benítez
 
Lezione di strategia aziendale
Lezione di strategia aziendaleLezione di strategia aziendale
Lezione di strategia aziendale
Alberto Asquer
 
2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct
2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct
2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct
Frederick Lane
 
6th Semester VTU BE ME question papers from 2010 to Dec 2015
6th Semester VTU BE ME question papers from 2010 to Dec 20156th Semester VTU BE ME question papers from 2010 to Dec 2015
6th Semester VTU BE ME question papers from 2010 to Dec 2015
Coorg Institute of Technology, Department Of Library & Information Center , Ponnampet
 
Companies-in-Government
Companies-in-GovernmentCompanies-in-Government
Companies-in-Government
Matthew Rees
 
Globalización
GlobalizaciónGlobalización
Globalización
Jordi Benítez
 
Alvin Roth, Nobel Prize 2012
Alvin Roth, Nobel Prize 2012Alvin Roth, Nobel Prize 2012
Alvin Roth, Nobel Prize 2012
Jordi Benítez
 
Leanlondon 19sep13
Leanlondon 19sep13Leanlondon 19sep13
Leanlondon 19sep13
Kinetik Solutions Ltd
 
7th Semester VTU BE ME question papers from 2010 to June 2016
7th Semester VTU BE ME question papers from 2010 to June 20167th Semester VTU BE ME question papers from 2010 to June 2016
7th Semester VTU BE ME question papers from 2010 to June 2016
Coorg Institute of Technology, Department Of Library & Information Center , Ponnampet
 
Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011
Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011
Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011
FAO
 
Food preservation
Food preservationFood preservation
Food preservation
University
 
Resume
ResumeResume
Resume
madhav v
 
FISTF World Cup portfolio
FISTF World Cup portfolioFISTF World Cup portfolio
FISTF World Cup portfolio
Alan Collins
 

Viewers also liked (17)

Institutions and the market for corporate control in Asia: an empirical analysis
Institutions and the market for corporate control in Asia: an empirical analysisInstitutions and the market for corporate control in Asia: an empirical analysis
Institutions and the market for corporate control in Asia: an empirical analysis
 
MY C.V
MY C.VMY C.V
MY C.V
 
The Future Company
The Future CompanyThe Future Company
The Future Company
 
Brunch Menu March 2015
Brunch Menu March 2015Brunch Menu March 2015
Brunch Menu March 2015
 
El futuro de la prensa
El futuro de la prensaEl futuro de la prensa
El futuro de la prensa
 
Lezione di strategia aziendale
Lezione di strategia aziendaleLezione di strategia aziendale
Lezione di strategia aziendale
 
2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct
2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct
2015-09-21 Cyberethics for Educators: The Rising Cost of Digital Misconduct
 
6th Semester VTU BE ME question papers from 2010 to Dec 2015
6th Semester VTU BE ME question papers from 2010 to Dec 20156th Semester VTU BE ME question papers from 2010 to Dec 2015
6th Semester VTU BE ME question papers from 2010 to Dec 2015
 
Companies-in-Government
Companies-in-GovernmentCompanies-in-Government
Companies-in-Government
 
Globalización
GlobalizaciónGlobalización
Globalización
 
Alvin Roth, Nobel Prize 2012
Alvin Roth, Nobel Prize 2012Alvin Roth, Nobel Prize 2012
Alvin Roth, Nobel Prize 2012
 
Leanlondon 19sep13
Leanlondon 19sep13Leanlondon 19sep13
Leanlondon 19sep13
 
7th Semester VTU BE ME question papers from 2010 to June 2016
7th Semester VTU BE ME question papers from 2010 to June 20167th Semester VTU BE ME question papers from 2010 to June 2016
7th Semester VTU BE ME question papers from 2010 to June 2016
 
Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011
Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011
Nicaragua - Datos a nivel comunitario, IV Censo Agropecuario 2011
 
Food preservation
Food preservationFood preservation
Food preservation
 
Resume
ResumeResume
Resume
 
FISTF World Cup portfolio
FISTF World Cup portfolioFISTF World Cup portfolio
FISTF World Cup portfolio
 

Similar to Make Text Search "Work" for Your Apps - JavaOne 2013

Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
Maggie Pint
 
Got documents Code Mash Revision
Got documents Code Mash RevisionGot documents Code Mash Revision
Got documents Code Mash Revision
Maggie Pint
 
Got documents?
Got documents?Got documents?
Got documents?
Maggie Pint
 
ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2
Erik Noren
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
Web Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchWeb Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated Search
Nikesh Narayanan
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdf
TOUSEEQHAIDER14
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Simon Hughes
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
Nikesh Narayanan
 
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuGraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
Neo4j
 
Intern Project Showcase.pptx
Intern Project Showcase.pptxIntern Project Showcase.pptx
Intern Project Showcase.pptx
ritikgarg48
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
Ike Ellis
 
dbms introduction.pptx
dbms introduction.pptxdbms introduction.pptx
dbms introduction.pptx
ATISHAYJAIN847270
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
Radhika R
 
Optimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxOptimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptx
JasonTuran2
 
DBMS Bascis
DBMS BascisDBMS Bascis
IN106 Performance with MongoDB
IN106 Performance with MongoDBIN106 Performance with MongoDB
IN106 Performance with MongoDB
Kim Greene Consulting, Inc.
 
Bi 5
Bi 5Bi 5
Bi 5
shivz3
 

Similar to Make Text Search "Work" for Your Apps - JavaOne 2013 (20)

Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
Got documents Code Mash Revision
Got documents Code Mash RevisionGot documents Code Mash Revision
Got documents Code Mash Revision
 
Got documents?
Got documents?Got documents?
Got documents?
 
ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Web Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchWeb Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated Search
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdf
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
 
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuGraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
 
Intern Project Showcase.pptx
Intern Project Showcase.pptxIntern Project Showcase.pptx
Intern Project Showcase.pptx
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
dbms introduction.pptx
dbms introduction.pptxdbms introduction.pptx
dbms introduction.pptx
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
Optimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptxOptimizing Application Performance - 2022.pptx
Optimizing Application Performance - 2022.pptx
 
DBMS Bascis
DBMS BascisDBMS Bascis
DBMS Bascis
 
IN106 Performance with MongoDB
IN106 Performance with MongoDBIN106 Performance with MongoDB
IN106 Performance with MongoDB
 
Bi 5
Bi 5Bi 5
Bi 5
 

Recently uploaded

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Undress Baby
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 

Recently uploaded (20)

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 

Make Text Search "Work" for Your Apps - JavaOne 2013

  • 1. Java One 2013 Java One 2013 Make Text Search “Work” for your Apps Anirban Mukherjee amukherjee@verisign.com Manish Maheshwari mmaheshwari@verisign.com 08-May-2013
  • 2. Speakers Anirban Software Architect, Verisign Manish Software Architect, Verisign Verisign Public 2
  • 3. Agenda • Overview of Text Search • What is Text Search • Differences from traditional database search • Text Search implementation for regular web applications • Relational Databases vs Text Search Engines • Recommended design principles Verisign Public 3
  • 4. Overview of Text Search Verisign Public 4
  • 5. What is Text Search (1/2) • Also called Full-text search • Enter a few keywords • Get results fast with most relevant matches on top • Can work well on unstructured information • Documents e.g. resumes, papers • Free text fields like Titles and Descriptions • Non-exact or approximate matches may be returned Verisign Public 5
  • 6. What is Text Search (2/2) • Origins in document processing systems and web search • Now a de-facto requirement for regular web applications • Enterprise applications • Cloud apps / SaaS solutions for Enterprises • Expanding into Real-time analytics • We will focus on apps with relational database stores • Unique challenges • Don’t fit Text engines as naturally as Document oriented data stores • Frequent entity modifications are usually involved Verisign Public 6
  • 7. Lookup-style search: Example • Explicit fields • No relevance ranking for results • Traditional RDBMS style implementation using SQL • Wildcards can be used for partial matches (uses SQL “like”) • Limits on results, pagination often absent Verisign Public 7
  • 8. Text Search: Example • No explicit fields specified in input (may be present in Advanced Search) • Keyword based operation • Results are ordered by relevance and paginated • No need to use wildcards • Auto-suggestion is often present while typing input keywords Verisign Public 8
  • 9. More Text Search Examples • Single input field or multiple fields combined with booleans • Usability considerations come into play • Terms or keywords have to be input in both cases Verisign Public 9
  • 10. Text Search features: Summary • Term based search at fast speeds • index returns docs matching input terms fast • boolean AND, OR, NOT combinations can be used • Relevance • usually based on TF (frequency of the term in a document) and IDF (rarity of the term across all documents) • other factors can be incorporated if needed • Approximate matches • Stemming and Synonyms • Fuzzy matches and spelling auto-corrections Verisign Public 10
  • 11. Inverted Index/ Text index • Helps in fast retrieval of documents matching terms • Index creating involves a good bit of processing • Different fields in a document can be indexed differently • Indexing is very closely tied to search queries • Text Engines can handle many indexes Verisign Public 11
  • 12. Popular Java-based Text Search libraries and platforms Verisign Public 12
  • 13. RDBMS Full-text Search components • Proprietary extensions to SQL to support text search • Pros: Single data source for Apps • Apps can interact with the database only • Cons: Limits on flexibility, portability and perhaps scalability Verisign Public 13
  • 14. Typical Text Search App architecture • RelationalEntity – TextDoc mappings have to be done properly • Only a subset of data should go to text index • DB is primary datastore • Text searches always hit text index first Verisign Public 14
  • 15. Location of the Text Search Engine • Library/plugin • Lucene • Hibernate Search • Database Full-text • Oracle Text • MySQL Full-text • Search servers • Solr • Elasticsearch Verisign Public 15
  • 16. Relation Databases vs Text Engines Verisign Public 16
  • 17. RDBMS vs Text Engine: Structural Mismatch • Relational databases • many data types • tables represent entities • entities have relationships between them • normalized schema and joins • Text engines • fundamentally only type is string • flat documents • no relationships between documents • joins between documents are not supported • Relationships have to be flattened and embedded into text documents • duplication of data • can be difficult to implement • relationships can be complex and 2-way too Verisign Public 17
  • 18. RDBMS vs Text Engine: Sync Mismatch • Data updates have to be performed in two different places • RDBMS and Text Engine • Structural mismatch can make this fragile • change to a single entity can affect many documents • updates occur from many places in the app • Text engines are not transactional like RDBMS • Not all Text Engines are near real-time capable • Elasticsearch focuses on near real-time updates • “commit” for Text engines is expensive Verisign Public 18
  • 19. RDBMS vs Text Engine: Retrieval Mismatch • Text Engine should typically have only a subset of the full data • Text index is not a database • Too much data in the index makes it slow • Purpose of text index is to provide initial result page(s) • Document type plus entity primary key from the database uniquely identifies a document • Represents an entity (often partial) • Full details can be retrieved from database • Ideally should use at most a single database query per result view Verisign Public 19
  • 20. Design Principles for Text Search Apps Verisign Public 20
  • 21. Design Principles for Text Search apps • We consider regular web apps which have relational databases as the primary data source • User confidence in the search solution is vital • Some principles may require thinking that departs from traditional database apps Verisign Public 21
  • 22. P1: The most basic searches must work perfectly first Problem: If the app does not return good results for the basic cases, users will lose faith very easily. • E.g.: If an exact Title is entered, user certainly expects it to be listed right on top • Stemming, synonyms etc. must not jeopardize exact matches • Precision is more important than recall • Test cases should cover these elaborately • Make it clear to users that matches are primarily keyword based Verisign Public 22
  • 23. P2: Text Indexes should be used for all applicable views of the data (1/2) Problem: Sync mismatch can cause loss of confidence since data showing up in the tables may not be showing up in searches. • The data mismatch may arise due to regular indexing delays or application bugs. • Avoid views built directly from the database tables while bypassing the text index • Detection of indexing issues/errors happens early • corrective action can be taken fast Verisign Public 23
  • 24. P2: Text Indexes should be used for all applicable views of the data (2/2) • Admin views can have a secondary option to look up the database directly in case of problems • Elasticsearch and latest versions of Solr strive to make index updates available in near real-time Verisign Public 24
  • 25. P3: Accommodate regular Text index re-creation (1/2) Problem: Index re-creation can be time consuming and involve application downtime. • Improvements and enhancements to text search typically require full index re-creation. • Text indexes may also get out of sync with the primary database store due to errors and bugs. • Text indexes are not as resilient or robust as databases with respect to durability. Verisign Public 25
  • 26. P3: Accommodate regular Text index re-creation (2/2) • Embrace the need for full index re-creation • Devise ways to do it smoothly on demand and regularly • Strategy 1: Keep alternate indexes in active/passive. Periodically, • re-create the passive and switch it to active mode • switch the old active to passive mode (to be re-created next) • Strategy 2: Store timestamp for every doc at indexing time • re-index all documents using the database data • Remove all docs with timestamp older than the re-index start time Verisign Public 26
  • 27. P4: Indexing and Searches are closely tied - think of both together Problem: Enhancements are needed to the search. Addition of more searchable data is breaking older stuff. • Unlike in the database, index updates are strongly coupled to the types of queries • not viable to do data modeling work first and think of queries later • Strive to limit the amount of indexed data • Bulk indexing is much slower than bulk database loads • Scale out the search servers as data grows • Performance testing is needed with a focus on frequent searches Verisign Public 27
  • 28. P5: Avoid treating the Text Engine as a relational store Problem: Searches have become really slow as the data has grown. Each subsequent page also takes a long time to load. • Anti-pattern: Direct one-to-one table to doc mapping with “joins” inside the App • Text engines are not relational databases • App joins will tend to collapse as data grows, they may involve many Text engine queries • Strive to make the summary results load directly from the Text Engine • Initial results list page should have minimal fields • Only minimum essential fields have to be in the index • Avoid sorts on many fields, consider faceting instead Verisign Public 28
  • 29. P6: Avoid wildcards in user input (1/2) Problem: Users are not fully satisfied with keyword based matches. They want partial matches within the keywords too. • Search engines allow wildcards but there are major pitfalls • Relevance is lost, results are returned in arbitrary order similar to SQL “like” or grep • If stemming is in use, stems and not the original terms are in present the index. So wildcards may not give expected matches • E.g. management has Porter stem manag which is what gets into the index. So it no longer matches the wildcard pattern manage* Verisign Public 29
  • 30. P6: Avoid wildcards in user input (2/2) • Make use of auto-suggestion on a small number of important fields as the user types the input • Tends to be quite performant and lightweight if implemented properly • Can usually be implemented with edge n-grams for prefix matches • Try to avoid full n-grams for arbitrary substring matches • Number of edge n-grams is O(L), number of full n-grams is O(L2) Verisign Public 30
  • 31. Popular form of Auto-suggestion today Verisign Public 31
  • 32. P7: Analyze and improve Problem: Things are evolving rapidly and data volumes are increasing. It is hard to keep pace and improve performance and user experience. • Logs should be regularly analyzed for user behavior • Performance testing needs to be done at higher loads • Platform upgrades may be a reality • Rate-limiting needs to be implemented • But changes need to be resisted too … Verisign Public 32
  • 33. Conclusion • Text search is still evolving rapidly • Lucene is 12+ years old but is still very active • along with Solr and Elasticsearch • Cloud apps and high traffic websites need to scale up constantly • Relational databases backends are not going away soon • Good Text search designs will continue help • Enterprise search is now expanding to real-time analytics Verisign Public 33
  • 34. References • Hibernate Search in Action, Manning Publishers • http://www.elasticsearch.org • http://www.lucidworks.com/ Verisign Public 34
  • 35. Thank You © 2013 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.