The document discusses improving natural language processing (NLP) for information retrieval. It notes the challenge of connecting users to relevant information and outlines some approaches to improve search, including focusing on fundamentals like query segmentation, using various tools like facets to provide additional options, and starting simply to address discrete use cases before expanding capabilities. The goal is to better understand users through an iterative process of learning from their experiences.
1. A BETTER MATCH MEANS BETTER CARE®
Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
The Search for NLP
Standing up “QuickLP” for PoC
2. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Me
I’m an engineer because I’m a curious person who likes products & problem-
solving.
- Interpersonal rhetoric
- HCI
- Healthcare IT
- Solr
- Data “intuition”
4. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Kyruus Search
A better match means better care.
The Kyruus Search & API team exists to connect humans to relevant care by
connecting them to relevant data.
10. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Problem
Information retrieval 101
User Information
😬(you)
Information
Information
Information
Information
Information
Information
11. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Problem
Information retrieval 101
User Information
😬(you)
Information
Information
Information
Information
Information
Information
12. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
The Space | Statistical relevance
https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558
13. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
The Space | NLP
https://hackernoon.com/various-optimisation-techniques-and-their-impact-on-generation-of-word-embeddings-3480bd7ed54f
18. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Reality check
Your users don’t care t
User Information
😬(you)
Information
Information
Information
Information
Information
Information
19. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Door No. 3, Johnny!
Approach The work The hope The reality
Spray & pray
Tune, tweak, and “test” all sorts of
configurations, settings, analyzers, etc.
That you make enough
permutations to catch most
people and that the parts you
can’t cover just don’t show up
(head in the sand)
- You’re leaving some users out in the cold
- You’re spending valuable engineering
resources trying to fix it in a way that will never
last, simply building a house of cards that will
fall as soon as something changes
Host Sesame
Street
Spend lots of time to source a great
ML/AI/NLP candidate, spend lots of
money to secure the best candidate, and
spend a lot of time trying to get your
organization suddenly ready for the work
they will do (e.g. analytics, logging,
tracking, monitoring, et. al.)
You’ll spend enough money to
buy yourself a silver bullet, that
this person will save the day
- There are no silver bullets
- Having the right person is only part of the
equation, the organization must be at a point of
maturation to support them and their work
long-term
- You just bet all your chips on red—and the
house always wins
Crawl, walk,
run
Find areas of opportunity and exploit
them creatively with the tools on hand
You solve discrete use cases, one
at a time, while learning deeper
opportunities and greater
nuances in the user’s experience
- You really do solve painful user experiences
- You, your team, and your organization are
given the requisite time to grow & mature into a
new competency—at a fraction of the cost—
while delivering on user value throughout the
whole process
34. Kyruus, Inc. CONFIDENTIAL. DO NOT DISTRIBUTE
Resources
Ted Sullivan @ Lucidworks
Giovanni Fernandez-Kincade
Berlin Buzzwords
Haystack Conference
Activate Conference
SparkNLP Slack
Relevant Search -- book & Slack
This isn’t a master class on what to do exactly. This is intended to stir your creativity, prod at some things that you hadn’t thought about before, and get you minded in the right direction.
This isn’t a master class on what to do exactly. This is intended to stir your creativity, prod at some things that you hadn’t thought about before, and get you minded in the right direction.
Bad news: you are the reason they can’t get to it
Good news: you are the only way they will be able to get to it
Their life is in your hands
Bad news: you are the reason they can’t get to it
Good news: you are the only way they will be able to get to it
Their life is in your hands
Bad news: you are the reason they can’t get to it
Good news: you are the only way they will be able to get to it
Their life is in your hands
Word embeddings:
GloVe
Word2Vec
Bag of words
FastText
Polysemy:
I got the invite to do this talk
I got anxious
Hope you can say afterwards, “I got it”
The group grew: Bert, Ernie, Big Bird, etc.
It’s getting a bit out of hand
As much as your users likely love Sesame Street…
they don’t care about how bleeding edge your solution is
They’ll be grouchier than Oscar when your solution doesn’t work. They’ll be happy as Elmo when it does—regardless of how.
As much as your users likely love Sesame Street…
they don’t care about how bleeding edge your solution is
They’ll be grouchier than Oscar when your solution doesn’t work. They’ll be happy as Elmo when it does—regardless of how.
Cute, lovable puppet characters notwithstanding
There is likely a very fat initial part of your tail wherein you can get outsized gains and improvements. You won’t solve all the issues, but you’ll solve 80% of them with only 20% of the work or investment. Or perhaps you’ll solve the problems that equate to 80% of the user value, company bottom-line, etc.
The point is this: focus on the wins, not the hows; the value, not the tech.
You can probably spend 1-2 days tops and comb through some logs to find areas of opportunity for your application.
If you can’t see it clearly then how could you spec it clearly?
Dismax
E.g. Dismax
Optimize what you have before you build new, costly tech that needs to be optimized
“Pediatric cardiologist 46220” will have a greater chance of being properly tuned to relevance once we appropriate our data and shift to the terms-centric approach found in a dismax query
Suggested reading: Relevant Search by Doug Turnbull and John Berryman
It’s not cheating to use your app layer, other technologies, etc.
Redis + Zip
It’s not cheating to use your app layer, other technologies, etc.
Redis + Zip
Maybe instead of just ranking higher on a zip match you want to filter on it. Regex and modify your query to be a facet.
Or maybe you don’t want to filter out other zips but have concentric rings of sorting done based on your user’s submitted zip code.
Adding this layer is a trivial amount of work for your engineers, it’s a trivial amount of impact to your infrastructure, e.g. Redis storage, and it’s a trivial amount of added latency to the overall request time—but it’s a non-trivial upgrade to your user’s overall experience.
Facets are features. They’re facts about your data, simple truths that help you navigate it.
This is why users use them when your precision isn’t good and your recall is really high: facets are the features they wish they’d given you or that you’d discerned.
This being the case, you could be very aggressive and stuff keywords, highly-sought-after terms and phrases, into a special junk drawer for your documents and have those fields boosted in your dismax query.
If nothing else, it can help you see the shape of your data a bit better in terms of density & distribution which then will help you to best facilitate the “cheapLP” solution needed to get users to the data
Note, of course, that facets can be one of the first stops for your new ML/NLP engineer to get relevant feature data for their models
Ted Sullivan from Lucidworks
Wikipedia: “heart attackers” isn’t a phrase, “heart attack” is, leverage for phrase recognition
Don’t build a huge algorithm to know what “near me” means -- there are plenty of fat head areas of opportunity
Point-wise mutual information is a means of measuring associations in information theory, e.g. “heart” and “attack” vs. “heart attack”
The reality is that there are a LOT of problems to solve
If you don’t focus on one at a time then you probably won’t get to any of them. Sesame Street is smart in that they take one letter and one number at a time and teach it to kids.
Boiling the ocean is frustrating. Solving real user problems is a lot of fun, especially if you do it in a way that is cost-effective and creates a rapidity or momentum to your stream of value-delivery