SlideShare a Scribd company logo
Why Search?
(starring Elasticsearch)
Doug Turnbull
OpenSource Connections

OpenSource Connections
Hello
• Me
@softwaredoug
dturnbull@o19s.com
• Us
http://o19s.com
World class search consultants
Right here in C’ville!
Hiring passionate interns!
OpenSource Connections
Why Search?
• What does a dedicated search engine do?
o that a database doesn’t?

• Why not [MySQL|mongoDB|Cassandra | etc]?
• Why a dedicated search engine?

OpenSource Connections
Why not MySQL?
• We’ve got rows of stuff in tables. IE for SciFi
StackExchange, we’ve stored ~20K posts:
PostID

UserId

CreationDate

ViewCount

Body

0

1

2011-01124
11T20:52:46.75
3

<p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>

1

2

2013-02525
01T12:44:46.52
5

<p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>

OpenSource Connections
Why not MySQL?
• Our mission: Find all the “Darth Vader” in SciFi
StackExchange Posts!
P U C V Body
0 1 2 1 <p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>

1 2 2 5 <p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>

Found!

Missing!

OpenSource Connections
Why not MySQL – SQL Like?
• SQL “LIKE” operator – scan all rows for a specific
wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
Performs Table Scan
Match?
Match?

Match?
Match?
Approx 300ms to search a measly 20K docs!
(what if we had 20 Million?)
OpenSource Connections
SQL Like – other problems
• Can’t search for words out –of-order:
SELECT * FROM posts WHERE body LIKE "%vader,
darth%"
0 results
• Can’t search for alternate forms of a word:

SELECT * FROM posts WHERE body LIKE "%kittie
pictures%‚
SELECT * FROM posts WHERE body LIKE "%kitteh
pictures%"

OpenSource Connections
SQL Like – other problems
• No Ranking of Results – given these two docs:

I seem to remember a novel, I
think it was Dark Lord: The
Rise of Darth Vader, that
addressed this. It made the
assertion that while Darth
Vader had lost both hands, he
was still as formidable, in the
force sense,

- Directly about Darth Vader

One might ask how none of the Jedi
at Qui-Gon's funeral noticed that
there was a Dark Lord of the Sith
standing right behind them. Darth
Vader and Obi-Wan only noticed
each other when on the same station
… It's apparently hard to pick up
another force-user without knowing
he or she is there…

- Darth Vader is a side topic here

Which should come first?
OpenSource Connections
SQL Like| CTRL+F |grep is
1. Extremely Slow

2. Not fuzzy -- Needs exact literal matches, no
fuzziness!

3. Unranked -- Simply says y/n whether there is a
match

OpenSource Connections
Search needs to be
1. FAST! A data structure that can efficiently take
search terms and return a set of documents

2. FUZZY! A way to record positional and fuzzy
modifications to text to assist matching

3. FRUITFUL! Relevant documents bubble to the top.

OpenSource Connections
Lets play with an implementation
• Your database’s full text search features
o MySQL, for example has a FULLTEXT index
o Works for trivial cases, not the path of wisdom

• Lucene -> Elasticsearch
Lucene

Solr
Elasticsearch

• Lucene, 1999 by Doug Cutting
• Java library for search
• Solr, 2006, Yonik Seely
• First to put Lucene behind an
http interface
• Still going strong
• Elasticsearch, 2010, Shay Banon
• Alternative implementation
• Extremely REST-Y
OpenSource Connections
Elasticsearch
• Create an index

curl –XPUT http://localhost:9200/stackexchange
• Index some docs!
curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’

OpenSource Connections
What is being built?
The answer can be found in your textbook…
Book Index:
• Topics -> page no
• Very efficient tool – compare to
scanning the whole book!

Lucene uses an index:
• Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]

OpenSource Connections
Computers == Dumb
• Humans are smart
o I see “cat” or “cats” in the back of a book, no duh – jump
to page 9

• Computers are dumb,
o “CAT” != “cat” – no match returned
o “cat” != “cats” – no match returned

• Hence, when indexing, normalize text to more
searchable form:
cats -> cat
fitted -> fit
alumnus -> alumnu

OpenSource Connections
Normalization aka Text Analysis
• Raw input Filtered (char filter)
•
•

<p>Darth Vader dined with Luke</p>
Darth Vader dined with Luke

• Tokenized,
o Darth Vader dined with Luke
o [Darth] [Vader] [dined] [with] [Luke]

• Token filters (Lowercased, synonyms applied,
remove pointless words)
o [darth] [vader] [dine] [luke]

• Most importantly: this is highly configurable
OpenSource Connections
Normalization aka Text Analysis
curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined
with Luke‘
{
"tokens": [
{
"end_offset": 5,
"position": 1,
"start_offset": 0,
"token": "darth",
"type": "<ALPHANUM>"
},
{
"end_offset": 11,
"position": 2,
"start_offset": 6,
"token": "vader",
"type": "<ALPHANUM>"
},
{
"end_offset": 17,
"position": 3,
"start_offset": 12,
"token": "dine",
"type": "<ALPHANUM>"
},
{
"end_offset": 27,
"position": 5,
"start_offset": 23,
"token": "luke",
"type": "<ALPHANUM>"
}
]
}

OpenSource Connections
What is being built?
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
‚Body‛: ‚<p>We love Darth</p>‛,
‚Title‛: ‚...‛}’

OpenSource Connections
Ranking
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
‚Body‛: ‚<p>We love Darth</p>‛,
‚Title‛: ‚...‛}’

Can we store anything here to
help decide how relevant this
term is for this doc?

Yes!
- Term Frequency
- How much “darth” is in
this doc?
- Position within document
- Helps when we search for
the phrase “darth vader”
OpenSource Connections
Query Documents
• When did Darth Vader and Luke have dinner?
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true"
-d '
{
"query": {
"match": {
"Body": "luke darth dinner"
}
User Query
}
}

OpenSource Connections
What happens when we query?
luke darth dinner

How to consult
index for matches?
[darth]

Analysis

[luke]
[darth]
[dine]

Score for [darth]
docs (1 and 2)

[dine]
Score for [dine]
docs (1)

Return sorted
docs client

field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

...
OpenSource Connections
So Elasticsearch!
• FAST!
o Inverted index data structure is blazing fast
o Lucene is probably the most tuned implementation

• FUZZY!
o We use analysis to normalize text to canonical forms
o We can use positional information when querying (not
shown here)

• FRUITFUL!
o Relevant documents are scored based on relative term
frequency

OpenSource Connections
BUT WAIT THERE’S MORE
• Many non-traditional applications of “search”
o Rank file directory by proximity to current directory
o Geographic-aided search, rank based on distance and
search relevancy
o Q & A systems – Watson has a ton of Lucene
o Log aggregation, ie Kibana -- because in Lucene
everything is indexed!

• And many features!
o Spellchecking
o Facets
o More-like-this document

OpenSource Connections
QUESTIONS?

OpenSource Connections

More Related Content

Viewers also liked

Test Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ QuepidTest Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ QuepidDoug Turnbull
 
open source technologies & search engine design
open source technologies & search engine designopen source technologies & search engine design
open source technologies & search engine designPatrick O Leary
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsVahid Saffarian
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics1101989
 
Sociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, DiglossiaSociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, DiglossiaElnaz Nasseri
 

Viewers also liked (7)

Test Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ QuepidTest Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ Quepid
 
open source technologies & search engine design
open source technologies & search engine designopen source technologies & search engine design
open source technologies & search engine design
 
شکار 2
شکار 2شکار 2
شکار 2
 
What is surrealism ?
What is surrealism ?What is surrealism ?
What is surrealism ?
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Sociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, DiglossiaSociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, Diglossia
 

Similar to Why Search? (starring Elasticsearch)

Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesFariz Darari
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Aaron Blythe
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesDorothea Salo
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
 
Introduction to Linked Data 1/5
Introduction to Linked Data 1/5Introduction to Linked Data 1/5
Introduction to Linked Data 1/5Juan Sequeda
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked DataRobert Sanderson
 
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD WorkshopFergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshopdri_ireland
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talkrtelmore
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and HowBigBlueHat
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataJuan Sequeda
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
Linking American Art to the Cloud
Linking American Art to the CloudLinking American Art to the Cloud
Linking American Art to the CloudGeorgina Goodlander
 
Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)Justin Carmony
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBlehresman
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFdonaldlsmithjr
 
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broDefcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broPriyanka Aash
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Filip Ilievski
 

Similar to Why Search? (starring Elasticsearch) (20)

Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and Challenges
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archives
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Introduction to Linked Data 1/5
Introduction to Linked Data 1/5Introduction to Linked Data 1/5
Introduction to Linked Data 1/5
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked Data
 
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD WorkshopFergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
Linking American Art to the Cloud
Linking American Art to the CloudLinking American Art to the Cloud
Linking American Art to the Cloud
 
Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDB
 
Oslo
OsloOslo
Oslo
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
 
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broDefcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»QADay
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 

Why Search? (starring Elasticsearch)

  • 1. Why Search? (starring Elasticsearch) Doug Turnbull OpenSource Connections OpenSource Connections
  • 2. Hello • Me @softwaredoug dturnbull@o19s.com • Us http://o19s.com World class search consultants Right here in C’ville! Hiring passionate interns! OpenSource Connections
  • 3. Why Search? • What does a dedicated search engine do? o that a database doesn’t? • Why not [MySQL|mongoDB|Cassandra | etc]? • Why a dedicated search engine? OpenSource Connections
  • 4. Why not MySQL? • We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts: PostID UserId CreationDate ViewCount Body 0 1 2011-01124 11T20:52:46.75 3 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2013-02525 01T12:44:46.52 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> OpenSource Connections
  • 5. Why not MySQL? • Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts! P U C V Body 0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> Found! Missing! OpenSource Connections
  • 6. Why not MySQL – SQL Like? • SQL “LIKE” operator – scan all rows for a specific wildcard match SELECT * FROM posts WHERE body LIKE "%darth vader%" Performs Table Scan Match? Match? Match? Match? Approx 300ms to search a measly 20K docs! (what if we had 20 Million?) OpenSource Connections
  • 7. SQL Like – other problems • Can’t search for words out –of-order: SELECT * FROM posts WHERE body LIKE "%vader, darth%" 0 results • Can’t search for alternate forms of a word: SELECT * FROM posts WHERE body LIKE "%kittie pictures%‚ SELECT * FROM posts WHERE body LIKE "%kitteh pictures%" OpenSource Connections
  • 8. SQL Like – other problems • No Ranking of Results – given these two docs: I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense, - Directly about Darth Vader One might ask how none of the Jedi at Qui-Gon's funeral noticed that there was a Dark Lord of the Sith standing right behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there… - Darth Vader is a side topic here Which should come first? OpenSource Connections
  • 9. SQL Like| CTRL+F |grep is 1. Extremely Slow 2. Not fuzzy -- Needs exact literal matches, no fuzziness! 3. Unranked -- Simply says y/n whether there is a match OpenSource Connections
  • 10. Search needs to be 1. FAST! A data structure that can efficiently take search terms and return a set of documents 2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching 3. FRUITFUL! Relevant documents bubble to the top. OpenSource Connections
  • 11. Lets play with an implementation • Your database’s full text search features o MySQL, for example has a FULLTEXT index o Works for trivial cases, not the path of wisdom • Lucene -> Elasticsearch Lucene Solr Elasticsearch • Lucene, 1999 by Doug Cutting • Java library for search • Solr, 2006, Yonik Seely • First to put Lucene behind an http interface • Still going strong • Elasticsearch, 2010, Shay Banon • Alternative implementation • Extremely REST-Y OpenSource Connections
  • 12. Elasticsearch • Create an index curl –XPUT http://localhost:9200/stackexchange • Index some docs! curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ OpenSource Connections
  • 13. What is being built? The answer can be found in your textbook… Book Index: • Topics -> page no • Very efficient tool – compare to scanning the whole book! Lucene uses an index: • Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7] OpenSource Connections
  • 14. Computers == Dumb • Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump to page 9 • Computers are dumb, o “CAT” != “cat” – no match returned o “cat” != “cats” – no match returned • Hence, when indexing, normalize text to more searchable form: cats -> cat fitted -> fit alumnus -> alumnu OpenSource Connections
  • 15. Normalization aka Text Analysis • Raw input Filtered (char filter) • • <p>Darth Vader dined with Luke</p> Darth Vader dined with Luke • Tokenized, o Darth Vader dined with Luke o [Darth] [Vader] [dined] [with] [Luke] • Token filters (Lowercased, synonyms applied, remove pointless words) o [darth] [vader] [dine] [luke] • Most importantly: this is highly configurable OpenSource Connections
  • 16. Normalization aka Text Analysis curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘ { "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ] } OpenSource Connections
  • 17. What is being built? field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ ‚Body‛: ‚<p>We love Darth</p>‛, ‚Title‛: ‚...‛}’ OpenSource Connections
  • 18. Ranking field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ ‚Body‛: ‚<p>We love Darth</p>‛, ‚Title‛: ‚...‛}’ Can we store anything here to help decide how relevant this term is for this doc? Yes! - Term Frequency - How much “darth” is in this doc? - Position within document - Helps when we search for the phrase “darth vader” OpenSource Connections
  • 19. Query Documents • When did Darth Vader and Luke have dinner? curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d ' { "query": { "match": { "Body": "luke darth dinner" } User Query } } OpenSource Connections
  • 20. What happens when we query? luke darth dinner How to consult index for matches? [darth] Analysis [luke] [darth] [dine] Score for [darth] docs (1 and 2) [dine] Score for [dine] docs (1) Return sorted docs client field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> ... OpenSource Connections
  • 21. So Elasticsearch! • FAST! o Inverted index data structure is blazing fast o Lucene is probably the most tuned implementation • FUZZY! o We use analysis to normalize text to canonical forms o We can use positional information when querying (not shown here) • FRUITFUL! o Relevant documents are scored based on relative term frequency OpenSource Connections
  • 22. BUT WAIT THERE’S MORE • Many non-traditional applications of “search” o Rank file directory by proximity to current directory o Geographic-aided search, rank based on distance and search relevancy o Q & A systems – Watson has a ton of Lucene o Log aggregation, ie Kibana -- because in Lucene everything is indexed! • And many features! o Spellchecking o Facets o More-like-this document OpenSource Connections