SlideShare a Scribd company logo
1 of 50
De Bitmanager, 2016
You Know, for Search
Peter van der Weerd
De Bitmanager, 2016
Who am I?
• Peter van der Weerd
• Search specialist
• Self employed Bitmanager
• Enormous span of control 
De Bitmanager, 2016
Search
• Common sense:
Easy
Solved
De Bitmanager, 2016
Yeah, true…
• Install ES
• Fill it with some data
• And o/: we can search
De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search
results.
De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what
you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…
De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• Autocomplete
Results in max 5 destinations, query per
keystroke
• Disambiguation
Show a partioned result that enables people
to choose a destination
De Bitmanager, 2016
Autocomplete in action
De Bitmanager, 2016
Disambiguation in action
De Bitmanager, 2016
Scoring
De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequency
the more matched terms, the more important
• Idf = inverse document frequency
The more matched documents for the term,
the less important
De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high
occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
s
c
o
r
e
De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only used
to relative score multiple tokens
• Examples:
house
little
on
the
s
c
o
r
e
De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have used
df instead…
s
c
o
r
e
De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributed
over shards
(or use dfs_query_then_fetch)
De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:
What to deliver for query ‘p’ or ‘pa’?
De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
oSame (doc language == site language)
oLocal translations
oEnglish
oMismatch
De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!)
search term
Same for popularity: people ar typical not
searching for impopular things
• Example (from an english site):
amsterdam->amsterdam english popular
De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,
sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller
scores
• Argggggg….
De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5
house^1.0 vs houses^0.5
What if the Lucene score is more than 2 times
higher than the stem itself?
• We are doing entity search vs text search
De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on
the prairie
0.46 0.39 1.05
Querying for ‘house’:
De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the
base score
Idf is normalized between 0 .. 0.2 and added to the
base score
Giving a score varying between 1 and 1.4 per term
(sometimes we don’t use idf)
De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3
(Roma vs Rome in an English site)
• Mismatched language: -0.3
De Bitmanager, 2016
About N-grams
• For auto-complete: left-edge N-Grams
• Rome:
rome
rom
ro
r
De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match
(or Levenshtein distance)
s
c
o
r
e
De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query
De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0
Score is the max. score
Tiebreaker=1
Score is the sum of all the individual scores
(same behavior as boolean or)
De Bitmanager, 2016
Dismax example
• Q= the house
Suppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28
this makes documents containing ‘the house’ a
little bit more important than ‘house’ only.
De Bitmanager, 2016
Difficulties
• Lack of context
• Hard to create a reliable scoring model
De Bitmanager, 2016
Different approach
• Canonical name:
 Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands
De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more
equal than others…
Self name is most important
Other names (like the city where a hotel resides)
are less important
• Dismax over self name and other
De Bitmanager, 2016
Payload
• Small piece of information that is added to
every occurrence
• Basically a byte[]
De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of
a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit
integer, and indexed as a payload
De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulate
dismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info
is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but
more difficult to use
De Bitmanager, 2016
Search
• Difficult
• Sensitive equilibrium
• Impossible to serve them all
De Bitmanager, 2016
Suits
De Bitmanager, 2016
Suits
• Reasons for people to wear a suit might
include:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc

De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is
adviced.
• The fact that a term occurs in both the title as
the abstract doesn’t make it roughly twice as
important.
But it does make it somewhat more important
De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others
neighborhood are more important…
• Example: search for a book:
chamber secrets rowling
• Expected top result:
Harry Potter and the Chamber of Secrets/J.K.
Rowling
De Bitmanager, 2016
Combining fields
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
• More important if in the same field?
De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the
requested
(all terms were present in the abstract field)
• Phrases behave even worse
De Bitmanager, 2016
Combining fields
• Suppose:
 we have 2 fields: F1 and F2
 2 query terms: qt1 and qt2
• Now we have choices how to combine…
De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
 this will prefer records where both terms are
found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
 this prefer behaves more like a there were no
fields
De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,
"author": "J.K. Rowling",
"title": "Harry Potter and the Chamber of Secrets",
"abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle
who haunts the girls' bathroom."
"_score": 2.0767038,
"author": "De Bitmanager",
"title": "Excerpt book",
"abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand
(blending)
De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
oSearch ‘rowling’ anywhere, score 1
oSearch ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch
De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making
explain part of your infrastructure
• At least expose the scores in debug mode.
De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does not
imply that:
I am trustworthy
I am competent
De Bitmanager, 2016
You Know, for Bits…
Peter @ bitmanager.nl

More Related Content

Viewers also liked

Python Pants Build System for Large Codebases
Python Pants Build System for Large CodebasesPython Pants Build System for Large Codebases
Python Pants Build System for Large CodebasesAngad Singh
 
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...Nagios
 
Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015Steve Pember
 
Catálogo 15 16 elksport
Catálogo 15 16 elksportCatálogo 15 16 elksport
Catálogo 15 16 elksportElk Sport
 
IM World presentation from Chris Swan: Application centric – how the cloud ha...
IM World presentation from Chris Swan: Application centric – how the cloud ha...IM World presentation from Chris Swan: Application centric – how the cloud ha...
IM World presentation from Chris Swan: Application centric – how the cloud ha...Cohesive Networks
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveAWS Chicago
 
NSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoNSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoRodrigo Montoro
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Developing highly scalable applications with Symfony and RabbitMQ
Developing highly scalable applications with  Symfony and RabbitMQDeveloping highly scalable applications with  Symfony and RabbitMQ
Developing highly scalable applications with Symfony and RabbitMQAlexey Petrov
 
Platform - Technical architecture
Platform - Technical architecturePlatform - Technical architecture
Platform - Technical architectureDavid Rundle
 
Application Deployment at UC Riverside
Application Deployment at UC RiversideApplication Deployment at UC Riverside
Application Deployment at UC RiversideMichael Kennedy
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Amazon Web Services
 
Sunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor EleganceSunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor EleganceOutdoorEleganceAus
 

Viewers also liked (18)

Python Pants Build System for Large Codebases
Python Pants Build System for Large CodebasesPython Pants Build System for Large Codebases
Python Pants Build System for Large Codebases
 
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
Nagios Conference 2014 - Fernando Covatti - Nagios in Power Transmission Util...
 
Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015Advanced Microservices - Greach 2015
Advanced Microservices - Greach 2015
 
Item analysis
Item analysisItem analysis
Item analysis
 
Catálogo 15 16 elksport
Catálogo 15 16 elksportCatálogo 15 16 elksport
Catálogo 15 16 elksport
 
IM World presentation from Chris Swan: Application centric – how the cloud ha...
IM World presentation from Chris Swan: Application centric – how the cloud ha...IM World presentation from Chris Swan: Application centric – how the cloud ha...
IM World presentation from Chris Swan: Application centric – how the cloud ha...
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at Cohesive
 
Jake Fox Pd. 5
Jake Fox Pd. 5Jake Fox Pd. 5
Jake Fox Pd. 5
 
NSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoNSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland Chapeco
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
ITV& Bashton
ITV& Bashton ITV& Bashton
ITV& Bashton
 
Developing highly scalable applications with Symfony and RabbitMQ
Developing highly scalable applications with  Symfony and RabbitMQDeveloping highly scalable applications with  Symfony and RabbitMQ
Developing highly scalable applications with Symfony and RabbitMQ
 
Platform - Technical architecture
Platform - Technical architecturePlatform - Technical architecture
Platform - Technical architecture
 
Linux Malware Analysis
Linux Malware Analysis	Linux Malware Analysis
Linux Malware Analysis
 
Yirgacheffe Chelelelktu Washed Coffee 2015
Yirgacheffe Chelelelktu Washed Coffee 2015Yirgacheffe Chelelelktu Washed Coffee 2015
Yirgacheffe Chelelelktu Washed Coffee 2015
 
Application Deployment at UC Riverside
Application Deployment at UC RiversideApplication Deployment at UC Riverside
Application Deployment at UC Riverside
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud.
 
Sunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor EleganceSunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor Elegance
 

Similar to You know, for search

Time collapsingmegaconference 9successaccelerationstrategies10-12-16
Time collapsingmegaconference 9successaccelerationstrategies10-12-16Time collapsingmegaconference 9successaccelerationstrategies10-12-16
Time collapsingmegaconference 9successaccelerationstrategies10-12-16Roland Frasier
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the FieldMongoDB
 
Living Labs Challenge Workshop
Living Labs Challenge WorkshopLiving Labs Challenge Workshop
Living Labs Challenge WorkshopTorben Brodt
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020RE/MAX Leading Edge
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020RE/MAX Leading Edge
 
Southern Maryland Golden Results Marketing Proposal 2017
Southern Maryland Golden Results Marketing Proposal 2017Southern Maryland Golden Results Marketing Proposal 2017
Southern Maryland Golden Results Marketing Proposal 2017RE/MAX Leading Edge
 
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools PresentationMarketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools PresentationRoland Frasier
 
Breaking the oracle tie
Breaking the oracle tieBreaking the oracle tie
Breaking the oracle tieagiamas
 
Golden Results Maryland Real Estate Seller Marketing Plan
Golden Results Maryland Real Estate Seller Marketing PlanGolden Results Maryland Real Estate Seller Marketing Plan
Golden Results Maryland Real Estate Seller Marketing PlanRE/MAX Leading Edge
 
Leveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataLeveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataNeo4j
 
Golden Results Maryland Seller Luxury Marketing Plan 2018
Golden Results Maryland Seller Luxury Marketing Plan 2018Golden Results Maryland Seller Luxury Marketing Plan 2018
Golden Results Maryland Seller Luxury Marketing Plan 2018RE/MAX Leading Edge
 
How to monetize your passion - An example in the game industry
How to monetize your passion - An example in the game industryHow to monetize your passion - An example in the game industry
How to monetize your passion - An example in the game industryVlad Micu
 
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014Codemotion
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Sergii Khomenko
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020RE/MAX Leading Edge
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
 
Event email campaign takedown 18 x your results!
Event email campaign takedown 18 x your results!Event email campaign takedown 18 x your results!
Event email campaign takedown 18 x your results!AnastasiaVoytehina
 
Domain Primitives in Action - DataTjej 2018
Domain Primitives in Action - DataTjej 2018Domain Primitives in Action - DataTjej 2018
Domain Primitives in Action - DataTjej 2018Omegapoint Academy
 

Similar to You know, for search (20)

Time collapsingmegaconference 9successaccelerationstrategies10-12-16
Time collapsingmegaconference 9successaccelerationstrategies10-12-16Time collapsingmegaconference 9successaccelerationstrategies10-12-16
Time collapsingmegaconference 9successaccelerationstrategies10-12-16
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Living Labs Challenge Workshop
Living Labs Challenge WorkshopLiving Labs Challenge Workshop
Living Labs Challenge Workshop
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020
 
Southern Maryland Golden Results Marketing Proposal 2017
Southern Maryland Golden Results Marketing Proposal 2017Southern Maryland Golden Results Marketing Proposal 2017
Southern Maryland Golden Results Marketing Proposal 2017
 
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools PresentationMarketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
Marketing Tools 2016 T&C2016 Roland Frasier Marketing Tools Presentation
 
Breaking the oracle tie
Breaking the oracle tieBreaking the oracle tie
Breaking the oracle tie
 
Golden Results Maryland Real Estate Seller Marketing Plan
Golden Results Maryland Real Estate Seller Marketing PlanGolden Results Maryland Real Estate Seller Marketing Plan
Golden Results Maryland Real Estate Seller Marketing Plan
 
Leveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim DataLeveraging Graph Analytics for Fraud Detection in PaySim Data
Leveraging Graph Analytics for Fraud Detection in PaySim Data
 
Golden Results Maryland Seller Luxury Marketing Plan 2018
Golden Results Maryland Seller Luxury Marketing Plan 2018Golden Results Maryland Seller Luxury Marketing Plan 2018
Golden Results Maryland Seller Luxury Marketing Plan 2018
 
How to monetize your passion - An example in the game industry
How to monetize your passion - An example in the game industryHow to monetize your passion - An example in the game industry
How to monetize your passion - An example in the game industry
 
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
 
Bootstrapping 101
Bootstrapping 101Bootstrapping 101
Bootstrapping 101
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
 
Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020Maryland Real Estate Seller Listing Presentation 2020
Maryland Real Estate Seller Listing Presentation 2020
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Event email campaign takedown 18 x your results!
Event email campaign takedown 18 x your results!Event email campaign takedown 18 x your results!
Event email campaign takedown 18 x your results!
 
Domain Primitives in Action - DataTjej 2018
Domain Primitives in Action - DataTjej 2018Domain Primitives in Action - DataTjej 2018
Domain Primitives in Action - DataTjej 2018
 

Recently uploaded

Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 

Recently uploaded (20)

Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 

You know, for search

  • 1. De Bitmanager, 2016 You Know, for Search Peter van der Weerd
  • 2. De Bitmanager, 2016 Who am I? • Peter van der Weerd • Search specialist • Self employed Bitmanager • Enormous span of control 
  • 3. De Bitmanager, 2016 Search • Common sense: Easy Solved
  • 4. De Bitmanager, 2016 Yeah, true… • Install ES • Fill it with some data • And o/: we can search
  • 5. De Bitmanager, 2016 But… • Are the users satisfied? • Many people struggle with sub-optimal search results.
  • 6. De Bitmanager, 2016 Search as a toolbox • It consists of 1 or more(!) tools to find what you need Searchbox Faceting (intersecting) Sorting More like this Not more like this (this is not what I mean) Etc…
  • 7. De Bitmanager, 2016 Search at Booking • Destination based (city, region, airport, etc) • Autocomplete Results in max 5 destinations, query per keystroke • Disambiguation Show a partioned result that enables people to choose a destination
  • 11. De Bitmanager, 2016 Scoring • Lucene scores in general like: tf * idf • Tf = term frequency the more matched terms, the more important • Idf = inverse document frequency The more matched documents for the term, the less important
  • 12. De Bitmanager, 2016 Term frequency • Used to give more importance to relative high occurring terms. • Scoring examples for ‘house’ House The house The little house on the prairie The little house on the prairie blah blah blah s c o r e
  • 13. De Bitmanager, 2016 Inverse document frequency • Prefers less frequent tokens. • Useless on single token queries: it is only used to relative score multiple tokens • Examples: house little on the s c o r e
  • 14. De Bitmanager, 2016 Drawback of idf • Other example… Pekela Haarlem Amsterdam Paris • Booking switched off idf, but could have used df instead… s c o r e
  • 15. De Bitmanager, 2016 When does idf work • Idf typically work for large text-like queries. • The documents *must* be evenly distributed over shards (or use dfs_query_then_fetch)
  • 16. De Bitmanager, 2016 Is tf * idf enough? • Well, no… • What to deliver on a query for ‘Paris’? The city (ehm, the are several cities Paris) Airports? Hotels? Which one? There are 1000’s of them. • Even worse: What to deliver for query ‘p’ or ‘pa’?
  • 17. De Bitmanager, 2016 Record boost • Based on Popularity From where booked Language oSame (doc language == site language) oLocal translations oEnglish oMismatch
  • 18. De Bitmanager, 2016 + or x? • Boosts are implemented by adding • Intuitive justification: Language could be seen as yet another (implicit!) search term Same for popularity: people ar typical not searching for impopular things • Example (from an english site): amsterdam->amsterdam english popular
  • 19. De Bitmanager, 2016 But wait… • How big should the record-boost be? 0..1? 100? • Lucene score might vary heavely, sometimes more then 10x different • So lets take 10 as max record-boost But now the recordboost might out-weight smaller scores • Argggggg….
  • 20. De Bitmanager, 2016 Score ranges • Difficult to tinker with: For instance use a stemmed token with boost 0.5 house^1.0 vs houses^0.5 What if the Lucene score is more than 2 times higher than the stem itself? • We are doing entity search vs text search
  • 21. De Bitmanager, 2016 Different scorers Title Score:default Score:BM25 Score:custom House 1.22 0.77 1.20 The house 0.76 0.61 1.10 The little house on the prairie 0.46 0.39 1.05 Querying for ‘house’:
  • 22. De Bitmanager, 2016 Normalizing scores • Goal: each term is scored around 1.0 Base score 1.0 Tf is normalized between 0 .. 0.2 and added to the base score Idf is normalized between 0 .. 0.2 and added to the base score Giving a score varying between 1 and 1.4 per term (sometimes we don’t use idf)
  • 23. De Bitmanager, 2016 Language boosting • Same language or english: +0.7 • Local language: +0.3 (Roma vs Rome in an English site) • Mismatched language: -0.3
  • 24. De Bitmanager, 2016 About N-grams • For auto-complete: left-edge N-Grams • Rome: rome rom ro r
  • 25. De Bitmanager, 2016 About N-grams • When a user types ‘ro’… Rome Ródos Rotterdam Etc • Score depends on percentage of match (or Levenshtein distance) s c o r e
  • 26. De Bitmanager, 2016 Original approach • Multiple fields (name, city, region, etc) • Combining them by a weighted dismax query
  • 27. De Bitmanager, 2016 Dismax query • More subtle way of combining scores. • Score = max + (sum - max) * tieBreaker In words: the max plus a percentage of the others • Edge cases: Tiebreaker=0 Score is the max. score Tiebreaker=1 Score is the sum of all the individual scores (same behavior as boolean or)
  • 28. De Bitmanager, 2016 Dismax example • Q= the house Suppose S[the] = 0.8, S[house]=1.2 • Scores for different tiebreakers: Bool score (tiebreaker=1): 2.0 Max score (tiebreaker=0): 1.2 Score with tiebreaker=0.1: 1.28 this makes documents containing ‘the house’ a little bit more important than ‘house’ only.
  • 29. De Bitmanager, 2016 Difficulties • Lack of context • Hard to create a reliable scoring model
  • 30. De Bitmanager, 2016 Different approach • Canonical name:  Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands • Self name (indexed) Hotel V Frederiksplein • Rest (indexed) Amsterdam, Noord-Holland, Netherlands
  • 31. De Bitmanager, 2016 Weighting fields • All fields are equal but some fields are more equal than others… Self name is most important Other names (like the city where a hotel resides) are less important • Dismax over self name and other
  • 32. De Bitmanager, 2016 Payload • Small piece of information that is added to every occurrence • Basically a byte[]
  • 33. De Bitmanager, 2016 Nowadays: payloads • We need more information per occurrence of a token: Length of the original token Self-name or other location info Type of the name (hotel, city, landmark, etc) • All the above info is encoded in a 32 bit integer, and indexed as a payload
  • 34. De Bitmanager, 2016 Dismax vs payload • With fieldinfo in the payload we can simulate dismax behavior • We query only 1 index-field (instead of 5) • Context: easier to do advanced scoring: all info is in 1 scorer. • Payloads *are* possible in ElasticSearch, but more difficult to use
  • 35. De Bitmanager, 2016 Search • Difficult • Sensitive equilibrium • Impossible to serve them all
  • 37. De Bitmanager, 2016 Suits • Reasons for people to wear a suit might include: Hiding the fact that you cannot trust them Hiding their incompetence etc 
  • 38. De Bitmanager, 2016 Combining fields • To prevent double counting, a dismax is adviced. • The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important. But it does make it somewhat more important
  • 39. De Bitmanager, 2016 Combining fields • Intuitive reaction: query terms in each others neighborhood are more important… • Example: search for a book: chamber secrets rowling • Expected top result: Harry Potter and the Chamber of Secrets/J.K. Rowling
  • 40. De Bitmanager, 2016 Combining fields "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." • More important if in the same field?
  • 41. De Bitmanager, 2016 Combining fields • But: we get an excerpt book that contains the requested (all terms were present in the abstract field) • Phrases behave even worse
  • 42. De Bitmanager, 2016 Combining fields • Suppose:  we have 2 fields: F1 and F2  2 query terms: qt1 and qt2 • Now we have choices how to combine…
  • 43. De Bitmanager, 2016 Combining fields • (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)  this will prefer records where both terms are found in the same field • (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)  this prefer behaves more like a there were no fields
  • 44. De Bitmanager, 2016 Combining fields (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2) "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
  • 45. De Bitmanager, 2016 Combining fields (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2) "_score": 2.1447253, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
  • 46. De Bitmanager, 2016 Combining fields • Of course: way more possibilities. See the multi-match query for examples Most but not all possibilities can be done by hand (blending)
  • 47. De Bitmanager, 2016 Combining fields • Different strategy: Combine all fields as if they were one field Do some re-scoring afterwards Example: oSearch ‘rowling’ anywhere, score 1 oSearch ‘potter’ anywhere, score 1 oCombine with additional queries to do a finishing touch
  • 48. De Bitmanager, 2016 Explain • Always use explain (in debug mode) • Did I already tell you to always use explain? • Create a new application by first making explain part of your infrastructure • At least expose the scores in debug mode.
  • 49. De Bitmanager, 2016 Suits: beware the logic rules… • Cannot be reversed: • The fact that I am not wearing a suit does not imply that: I am trustworthy I am competent
  • 50. De Bitmanager, 2016 You Know, for Bits… Peter @ bitmanager.nl