2019.05.15
Reflected Intelligence:
AI, Search, and the Disruption of KM
DOD and KM Symposium 2019: What is the Future?
Trey Grainger
Chief Algorithms Officer
Trey Grainger
Chief Algorithms Officer
• Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research publications
• Advisor to Presearch, the decentralized search engine
• Advisor to several startups
• Open Source Apache Lucene / Solr contributor
About Me
The Search & AI Conference
COMPANY BEHIND
Who are we?
230 CUSTOMERS ACROSS THE
FORTUNE 1000
400+EMPLOYEES
OFFICES IN
San Francisco, CA (HQ)
Raleigh-Durham, NC
Cambridge, UK
Bangalore, India
Hong Kong
Employ about
40% of the active
committers on
the Solr project
40%
Contribute over
70% of Solr's open
source codebase70%
DEVELOP & SUPPORT
Apache
Industry’s most powerful
Intelligent Search & Discovery Platform.
What is the Goal of
Knowledge Management?
Creating Sharing
Using Managing
Knowledge & Information
Why: Achieve Organizational Objectives
Creating Sharing
Using Managing
Why: Achieve Organizational Objectives
Knowledge & Information
Creating Sharing
Using Managing
Knowledge & Information
What?:
Why?: Achieve Organizational Objectives
How?: Right Answers + Right People + Right Time
Creating Sharing
Using Managing
Knowledge & Information
What?:
Why?: Achieve Organizational Objectives
How?: Right Answers + Right People + Right Time
Search has become today’s de-facto user interface
for delivering knowledge & information
for
seeking knowledge & information
Search Appliance
Previous attempts have
failed.
• Proudly built with open-
source tech at its core:
Apache Solr & Apache Spark
• Personalizes work with
applied machine learning
• Proven on the biggest
corporate & government
information systems
Let the most respected
analysts in the world
speak on our behalf
Dassault Systèmes
Mindbreeze
Coveo
Microsoft
Attivio
Expert System
Smartlogic
Sinequa
IBM
IHS Markit
Funnelback
Micro Focus
COMPLETENESS OF VISION
ABILITYTOEXECUTE
CHALLENGERS LEADERS
NICHE PLAYERS VISIONARIES
Source: June 2018 Gartner Magic Quadrant report on Insight Engines.
© Gartner, Inc.
What do you mean by "Search”?
20 Years Ago:
Search was navigating a
Taxonomy of Relationships
10 Years Ago
Search was finding 10 Blue Links
Today’s Search Is:
•Domain-aware
•Assistive
•Contextual & Personalized (location, last search, profile)
•Conversational
•Multi-modal (Text, Voice, Images, Event/Pushed-based)
•Smart (AI-powered)
•Beyond links and information to
Answers and Action
Basic Keyword Search
(inverted index, tf-idf, bm25,
multilingual text analysis, query
formulation, etc.)
Query Intent
(query classification, semantic query parsing,
semantic knowledge graphs, concept
expansion, automatic query rewrites, clustering,
classification, personalization, question/answer
systems, virtual assistants)
Automated Relevancy Tuning
(Signals, AB Testing/multi-armed
bandits/back-testing, genetic
algorithms, Deep Learning,
Learning to Rank)
Self-learning
Taxonomies / Entity
Extraction
(entity recognition,
taxonomies, ontologies,
business rules,
synonyms, etc.)
Search Intelligence Spectrum
Key Query Intent Components:
• Apache Solr
• Solr Text Tagger
• Semantic Knowledge Graph
• Statistical Phrase Identifier
• Fusion Semantic Query Pipelines
• Fusion AI Synonyms Job
• Fusion AI Token & Phrase Spell Correction Job
• Fusion AI Head/Tail Analysis Job
• Fusion AI Phrase Identification Job
• Fusion Query Rules Engine
Through these tools, the engine self-learns
domain-specific semantic relationships
… and enables domain experts to
easily accept or adjust the built in AI… …completely deferring to the AI, or
trusting it above a certain
confidence level, or
even manually
approving every
suggestion.
Fusion AI Jobs
Traditional
Keyword
Search
Recommendations
Semantic
Search
User Intent
Personalized
Search
Augmented
Search
Domain-aware
Matching
Understanding User Intent
What is
Reflected Intelligence?
Importance of Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
Southern Data Science
Signal Boosting
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Results
Alonzo pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc12
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
Query Document Signal
Boost
pizza doc22 54,321
pizza doc12 987
soup doc17 1,234
soup doc2 2,345
… …
pizza ⌕
query: pizza
boost: doc22^54321
boost: doc12^987
ƒ(x) = Σ(click * click_weight * time_decay) +
Σ(purchase * purchase_weight * time_decay)
+ other_factors
ipad
ipad
• 200%+ increase in
click-through rates
• 91% lower TCO
• 50,000 fewer support
tickets
• Increased customer
satisfaction
Learning to Rank
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Re
Alonzo pizza do
do
do
Elena soup do
do
do
Ming pizza do
do
do
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc12
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
Feature Weight
title_match_any_terms 15.25
is_known_category 10
popularity 9.5
content_age 9.2
… …
pizza ⌕
Initial Results:
1) doc1
2) doc2
3) doc3
Build Ranking Classifier
(from Implicit Relevance Judgements)
Final Results:
1) doc3
2) doc1
3) doc2
Collaborative Filtering (Recommendations)
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
User Query Results
Alonzo pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc12
Alonzo purchase doc22
Ming click doc22
Ming purchase doc12
Elena click doc2
… … …
User Item Weight
Alonzo doc22 1.0
Alonzo doc12 0.4
… … …
Ming doc12 0.9
Ming doc22 0.6
… … …
pizza ⌕
Matrix Factorization
Recommendations for Alonzo:
• doc22: “Peperoni Pizza”
• doc12: “Cheese Pizza”
…
Summary of Signal-based Ranking Models
• Signals Boosting: Ensures most popular specific
content/answers for a query returns first
• Learning to Rank: Learns a model of which
combinations of features generally matter most across
users, and ranks all content/answers using that model
• Collaborative Filtering: Learns which items are best to
recommend to a given user (or related to a given item)
based on behavior of other users who have previously
interacted with similar items
User
Searches
User
Sees
Results
User
takes an
action
Today, many organizations run
A/B experiments to test hypotheses to
“limit” the unknown negative impact
to a subset of users
…and then make only the specific choices
that will achieve the desired outcomes
But what if we could peer into millions of
alternate futures…
In other words, imagine if we could
simulate user interactions to
changes before ever having to
expose real users to those changes?
User Query Results
Alonz
o
pizza doc10,
doc22,
doc12, …
Elena soup doc84,
doc2,
doc17, …
Ming pizza doc10,
doc22,
doc12, …
… … …
User Action Document
Alonzo click doc22
Elena click doc17
Ming click doc10
Alonzo purchase doc22
Ming click doc22
Ming purchase doc22
Elena click doc2
… … …
Relevance Simulation (backtesting)
Example Use Cases
How do you tell this story?
Goal: Enable our users to independently turn data into information into knowledge
Digital Commerce
The Curation Challenge:
Gracefully combining human
and machine intelligence to
deliver relevance
Personalizati
on
Regular Search Results:
Personalized Search Results:
User:
Digital Workplace
(“Enterprise Search”)
Facet,
Topic &
Cluster
Query Rule
Matching
Natural
Language
Machine
Learning
Boosted
Results
Signals
Content
Index
System Generated
Human Generated
Application Generated
Solution
Digital Workplace
Data
NLP: NER, Phrases, POS
Document Classification
Anomaly Detection
Clustering
Topic Detection
Connectors
ETL Pipelines
Search Engine &
Data Processing
SQL Engine
Rules Engine
Scheduling & Alerting
Query Pipelines
Query Intent Detector
Automatic Relevancy
Signals & Query Analytics
Recommenders
A/B Testing
Scalable Operations
Extensible
System Generated
Application Generated
Data
Modular Components
Stateless Architecture
User-focused Experience
Geospatial Mapping
Results Preview
Rapid Prototyping
Digital Workplace
Solution
CloudScalable CDCR Security
Human Generated
Connect users to
insights precisely at
their moment of need
any format, any platform
What is a Knowledge Graph?
(vs. Ontology vs. Taxonomy vs. Synonyms, etc.)
Overly Simplistic Definitions
Ontology: Defines relationships between types of things
[ animal eats food; human is animal ]
Knowledge Graph: Instantiation of an
Ontology (contains the things that are related)
[ john is human; john eats food ]
Taxonomy: Classifies things into Categories
[ john is Human; Human is Mammal; Mammal is Animal ]
Synonyms List: Provides substitute words that can be used to represent the
same or very similar things
[ human => homo sapien, mankind; food => sustenance, meal ]
Alternative Labels: Substitute words with identical meanings
[ CTO => Chief Technology Officer; specialise => specialize ]
In practice, there is significant overlap…
Synonyms
List
Taxonomy
Ontology
Knowledge Graph
Alt.
Labels
What kind of Knowledge Graph
can help us with the
kinds of problems we encounter
in Search use cases?
Knowledge
Graph
Challenges of building a traditional knowledge graph
Because current knowledge bases / ontology learning systems typically
requires explicitly modeling nodes and edges into a graph ahead of time, this
unfortunately presents several limitations to the use of such a knowledge graph:
• Entities not modeled explicitly as nodes have no known relationships to any other entities.
• Edges exist between nodes, but not between arbitrary combinations of nodes, and therefore
such a graph is not ideal for representing nuanced meanings of an entity when appearing
within different contexts, as is common within natural language.
• Substantial meaning is encoded in the linguistic representation of the domain that is lost
when the underlying textual representation is not preserved: phrases, interaction of concepts
through actions (i.e. verbs), positional ordering of entities and the phrases containing those
entities, variations in spelling and other representations of entities, the use of adjectives to
modify entities to represent more complex concepts, and aggregate frequencies of occurrence
for different representations of entities relative to other representations.
• It can be an arduous process to create robust ontologies, map a domain into a graph
representing those ontologies, and ensure the generated graph is compact, accurate,
comprehensive, and kept up to date.
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A
compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
most often used in
reference to
“free text”
But… unstructured data is really
more like “hyper-structured”
data. It is a graph that contains
much more structure than
typical “structured data.”
Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon Musk 5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at the 2019 DOD & Federal
KM Symposium. #KMSymposium is being held in
Baltimore May 14-16, 2019. Trey got his masters
from Georgia Tech.
Trey Grainger works for Lucidworks.
He is speaking at the DOD & Federal
KM Symposium 2019.
#KMSymposium
(DOD & Federal KM Symposium) is being
held in Baltimore May 14-16, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Unstructured Data
Trey Grainger works for Lucidworks.
He is speaking at the DOD & Federal
KM Symposium 2019.
#KMSymposium
(DOD & Federal KM Symposium) is being
held in Baltimore May 14-16, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Foreign Key?
Trey Grainger works for Lucidworks.
He is speaking at the DOD & Federal
KM Symposium 2019.
#KMSymposium
(DOD & Federal KM Symposium) is being
held in Baltimore May 14-16, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzy Foreign Key? (Entity Resolution)
Trey Grainger works for Lucidworks.
He is speaking at the DOD & Federal
KM Symposium 2019.
#KMSymposium
(DOD & Federal KM Symposium) is being
held in Baltimore May 14-16, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
Trey Grainger works for Lucidworks.
He is speaking at the DOD & Federal
KM Symposium 2019.
#KMSymposium
(DOD & Federal KM Symposium) is being
held in Baltimore May 14-16, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
Not so fast!
Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the DOD & Federal
KM Symposium 2019.
#KMSymposium
(DOD & Federal KM Symposium) is being
held in Baltimore May 14-16, 2019.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Semantic Knowledge Graph
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Graph Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_skill has_related_skill
has_related_skill
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Knowledge
Graph
Knowledge
Graph
Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg
Content-based Recommendations (More Like This on Steroids)
http://localhost:8983/solr/job-postings/skg
Who’s in Love with Jean Grey?
NER automatically translates…
Barack Obama was the president of the United States of America. Before that, Obama was a
senator.
into…
<person id="barack_obama">Barack Obama</person> was the <role>president</role> of the
<country id="usa">United States of America</country>. Before that, <person
id="barack_obama">Obama</person> was a <role>senator</role>.
In the search engine, this would become:
text: Barack Obama was the president of the United States of America. Before that, Obama was a
senator.
person: Barack Obama
country: United States of America
role: [ president, senator ]
Named Entity Recognition (NER)
Differentiating related terms
Misspellings: managr => manager
Synonyms: cpa => certified public accountant
rn => registered nurse
r.n. => registered nurse
Ambiguous Terms*: driver => driver (trucking) ~80% likelihood
driver => driver (software) ~20% likelihood
Related Terms: r.n. => nursing, bsn
hadoop => mapreduce, hive, pig
*differentiated based upon user and query context
Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph
Semantic Knowledge Graph: Discovering ambiguous phrases
1) Use a document
classification field (i.e.
category) as the first level of
a graph, and the related
terms as the second level to
which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean
and representative your documents are.
Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of
categories. Effectively an dynamically-materialized probabilistic graphical model
Disambiguation by Category Example
Meaning 1: Restaurant => bbq, brisket, ribs, pork, …
Meaning 2: Outdoor Equipment => bbq, grill, charcoal, propane, …
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Thought Exercise
What do you think of when I say the
word “Facebook”?
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
What does “love” mean?
http://localhost:8983/solr/thesaurus/skg
What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
"embrace"
What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
So what’s my end goal here?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
"machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
Example Query
Why this Semantic Nuance Matters
Basic Keyword Search
(inverted index, tf-idf, bm25,
multilingual text analysis, query
formulation, etc.)
Query Intent
(query classification, semantic query parsing,
semantic knowledge graphs, concept
expansion, automatic query rewrites, clustering,
classification, personalization, question/answer
systems, virtual assistants)
Automated Relevancy Tuning
(Signals, AB Testing/multi-armed
bandits/back-testing, genetic
algorithms, Deep Learning,
Learning to Rank)
Self-learning
Taxonomies / Entity
Extraction
(entity recognition,
taxonomies, ontologies,
business rules,
synonyms, etc.)
Search Intelligence Spectrum
• Search is today’s de-facto User Experience for delivering
knowledge and information.
• Reflected Intelligence uses content + signals to constantly gain
intelligence about your domain, your content, and your users
through continuous feedback loops.
• Your content already IS a hyper-structured knowledge graph.
Smart search technology makes this graph usable so you don’t
have to build it all again yourself.
• The nuance of natural language really matters. Though all models
are wrong, make sure yours are “useful.”
• AI and Search represent an evolution in Knowledge Management.
They will disrupt some current practices, but ultimately serve as a
highly-complementary tool set to most practitioners.
Summary
Questions?
Trey Grainger
trey@lucidworks.com
@treygrainger
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: 39grainger
Thank you!

AI, Search, and the Disruption of Knowledge Management

  • 1.
    2019.05.15 Reflected Intelligence: AI, Search,and the Disruption of KM DOD and KM Symposium 2019: What is the Future? Trey Grainger Chief Algorithms Officer
  • 2.
    Trey Grainger Chief AlgorithmsOfficer • Previously: SVP of Engineering @ Lucidworks; Director of Engineering @ CareerBuilder • Georgia Tech – MBA, Management of Technology • Furman University – BA, Computer Science, Business, & Philosophy • Stanford University – Information Retrieval & Web Search Other fun projects: • Co-author of Solr in Action, plus numerous research publications • Advisor to Presearch, the decentralized search engine • Advisor to several startups • Open Source Apache Lucene / Solr contributor About Me
  • 3.
    The Search &AI Conference COMPANY BEHIND Who are we? 230 CUSTOMERS ACROSS THE FORTUNE 1000 400+EMPLOYEES OFFICES IN San Francisco, CA (HQ) Raleigh-Durham, NC Cambridge, UK Bangalore, India Hong Kong Employ about 40% of the active committers on the Solr project 40% Contribute over 70% of Solr's open source codebase70% DEVELOP & SUPPORT Apache
  • 4.
    Industry’s most powerful IntelligentSearch & Discovery Platform.
  • 6.
    What is theGoal of Knowledge Management?
  • 8.
    Creating Sharing Using Managing Knowledge& Information Why: Achieve Organizational Objectives
  • 9.
    Creating Sharing Using Managing Why:Achieve Organizational Objectives Knowledge & Information
  • 10.
    Creating Sharing Using Managing Knowledge& Information What?: Why?: Achieve Organizational Objectives How?: Right Answers + Right People + Right Time
  • 11.
    Creating Sharing Using Managing Knowledge& Information What?: Why?: Achieve Organizational Objectives How?: Right Answers + Right People + Right Time
  • 12.
    Search has becometoday’s de-facto user interface for delivering knowledge & information for seeking knowledge & information
  • 15.
  • 16.
    • Proudly builtwith open- source tech at its core: Apache Solr & Apache Spark • Personalizes work with applied machine learning • Proven on the biggest corporate & government information systems
  • 18.
    Let the mostrespected analysts in the world speak on our behalf Dassault Systèmes Mindbreeze Coveo Microsoft Attivio Expert System Smartlogic Sinequa IBM IHS Markit Funnelback Micro Focus COMPLETENESS OF VISION ABILITYTOEXECUTE CHALLENGERS LEADERS NICHE PLAYERS VISIONARIES Source: June 2018 Gartner Magic Quadrant report on Insight Engines. © Gartner, Inc.
  • 19.
    What do youmean by "Search”?
  • 20.
    20 Years Ago: Searchwas navigating a Taxonomy of Relationships
  • 21.
    10 Years Ago Searchwas finding 10 Blue Links
  • 22.
    Today’s Search Is: •Domain-aware •Assistive •Contextual& Personalized (location, last search, profile) •Conversational •Multi-modal (Text, Voice, Images, Event/Pushed-based) •Smart (AI-powered) •Beyond links and information to Answers and Action
  • 26.
    Basic Keyword Search (invertedindex, tf-idf, bm25, multilingual text analysis, query formulation, etc.) Query Intent (query classification, semantic query parsing, semantic knowledge graphs, concept expansion, automatic query rewrites, clustering, classification, personalization, question/answer systems, virtual assistants) Automated Relevancy Tuning (Signals, AB Testing/multi-armed bandits/back-testing, genetic algorithms, Deep Learning, Learning to Rank) Self-learning Taxonomies / Entity Extraction (entity recognition, taxonomies, ontologies, business rules, synonyms, etc.) Search Intelligence Spectrum
  • 27.
    Key Query IntentComponents: • Apache Solr • Solr Text Tagger • Semantic Knowledge Graph • Statistical Phrase Identifier • Fusion Semantic Query Pipelines • Fusion AI Synonyms Job • Fusion AI Token & Phrase Spell Correction Job • Fusion AI Head/Tail Analysis Job • Fusion AI Phrase Identification Job • Fusion Query Rules Engine
  • 28.
    Through these tools,the engine self-learns domain-specific semantic relationships
  • 29.
    … and enablesdomain experts to easily accept or adjust the built in AI… …completely deferring to the AI, or trusting it above a certain confidence level, or even manually approving every suggestion.
  • 30.
  • 32.
  • 33.
  • 34.
    Importance of FeedbackLoops User Searches User Sees Results User takes an action Users’ actions inform system improvements Southern Data Science
  • 35.
    Signal Boosting User Searches User Sees Results User takes an action Users’actions inform system improvements User Query Results Alonzo pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc12 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … Query Document Signal Boost pizza doc22 54,321 pizza doc12 987 soup doc17 1,234 soup doc2 2,345 … … pizza ⌕ query: pizza boost: doc22^54321 boost: doc12^987 ƒ(x) = Σ(click * click_weight * time_decay) + Σ(purchase * purchase_weight * time_decay) + other_factors
  • 36.
  • 38.
  • 40.
    • 200%+ increasein click-through rates • 91% lower TCO • 50,000 fewer support tickets • Increased customer satisfaction
  • 41.
    Learning to Rank User Searches User Sees Results User takesan action Users’ actions inform system improvements User Query Re Alonzo pizza do do do Elena soup do do do Ming pizza do do do … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc12 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … Feature Weight title_match_any_terms 15.25 is_known_category 10 popularity 9.5 content_age 9.2 … … pizza ⌕ Initial Results: 1) doc1 2) doc2 3) doc3 Build Ranking Classifier (from Implicit Relevance Judgements) Final Results: 1) doc3 2) doc1 3) doc2
  • 42.
    Collaborative Filtering (Recommendations) User Searches User Sees Results User takesan action Users’ actions inform system improvements User Query Results Alonzo pizza doc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc12 Alonzo purchase doc22 Ming click doc22 Ming purchase doc12 Elena click doc2 … … … User Item Weight Alonzo doc22 1.0 Alonzo doc12 0.4 … … … Ming doc12 0.9 Ming doc22 0.6 … … … pizza ⌕ Matrix Factorization Recommendations for Alonzo: • doc22: “Peperoni Pizza” • doc12: “Cheese Pizza” …
  • 43.
    Summary of Signal-basedRanking Models • Signals Boosting: Ensures most popular specific content/answers for a query returns first • Learning to Rank: Learns a model of which combinations of features generally matter most across users, and ranks all content/answers using that model • Collaborative Filtering: Learns which items are best to recommend to a given user (or related to a given item) based on behavior of other users who have previously interacted with similar items
  • 44.
    User Searches User Sees Results User takes an action Today, manyorganizations run A/B experiments to test hypotheses to “limit” the unknown negative impact to a subset of users
  • 45.
    …and then makeonly the specific choices that will achieve the desired outcomes But what if we could peer into millions of alternate futures…
  • 46.
    In other words,imagine if we could simulate user interactions to changes before ever having to expose real users to those changes?
  • 48.
    User Query Results Alonz o pizzadoc10, doc22, doc12, … Elena soup doc84, doc2, doc17, … Ming pizza doc10, doc22, doc12, … … … … User Action Document Alonzo click doc22 Elena click doc17 Ming click doc10 Alonzo purchase doc22 Ming click doc22 Ming purchase doc22 Elena click doc2 … … … Relevance Simulation (backtesting)
  • 51.
  • 54.
    How do youtell this story?
  • 55.
    Goal: Enable ourusers to independently turn data into information into knowledge
  • 64.
  • 71.
    The Curation Challenge: Gracefullycombining human and machine intelligence to deliver relevance
  • 79.
  • 82.
  • 83.
  • 85.
  • 86.
    NLP: NER, Phrases,POS Document Classification Anomaly Detection Clustering Topic Detection Connectors ETL Pipelines Search Engine & Data Processing SQL Engine Rules Engine Scheduling & Alerting Query Pipelines Query Intent Detector Automatic Relevancy Signals & Query Analytics Recommenders A/B Testing Scalable Operations Extensible System Generated Application Generated Data Modular Components Stateless Architecture User-focused Experience Geospatial Mapping Results Preview Rapid Prototyping Digital Workplace Solution CloudScalable CDCR Security Human Generated Connect users to insights precisely at their moment of need any format, any platform
  • 87.
    What is aKnowledge Graph? (vs. Ontology vs. Taxonomy vs. Synonyms, etc.)
  • 89.
    Overly Simplistic Definitions Ontology:Defines relationships between types of things [ animal eats food; human is animal ] Knowledge Graph: Instantiation of an Ontology (contains the things that are related) [ john is human; john eats food ] Taxonomy: Classifies things into Categories [ john is Human; Human is Mammal; Mammal is Animal ] Synonyms List: Provides substitute words that can be used to represent the same or very similar things [ human => homo sapien, mankind; food => sustenance, meal ] Alternative Labels: Substitute words with identical meanings [ CTO => Chief Technology Officer; specialise => specialize ] In practice, there is significant overlap… Synonyms List Taxonomy Ontology Knowledge Graph Alt. Labels
  • 91.
    What kind ofKnowledge Graph can help us with the kinds of problems we encounter in Search use cases?
  • 92.
    Knowledge Graph Challenges of buildinga traditional knowledge graph Because current knowledge bases / ontology learning systems typically requires explicitly modeling nodes and edges into a graph ahead of time, this unfortunately presents several limitations to the use of such a knowledge graph: • Entities not modeled explicitly as nodes have no known relationships to any other entities. • Edges exist between nodes, but not between arbitrary combinations of nodes, and therefore such a graph is not ideal for representing nuanced meanings of an entity when appearing within different contexts, as is common within natural language. • Substantial meaning is encoded in the linguistic representation of the domain that is lost when the underlying textual representation is not preserved: phrases, interaction of concepts through actions (i.e. verbs), positional ordering of entities and the phrases containing those entities, variations in spelling and other representations of entities, the use of adjectives to modify entities to represent more complex concepts, and aggregate frequencies of occurrence for different representations of entities relative to other representations. • It can be an arduous process to create robust ontologies, map a domain into a graph representing those ontologies, and ensure the generated graph is compact, accurate, comprehensive, and kept up to date. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
  • 94.
    most often usedin reference to “free text”
  • 95.
    But… unstructured datais really more like “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.”
  • 96.
    Structured Data Employees Table idname company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key
  • 97.
    Unstructured Data Trey Graingerworks at Lucidworks. He is speaking at the 2019 DOD & Federal KM Symposium. #KMSymposium is being held in Baltimore May 14-16, 2019. Trey got his masters from Georgia Tech.
  • 98.
    Trey Grainger worksfor Lucidworks. He is speaking at the DOD & Federal KM Symposium 2019. #KMSymposium (DOD & Federal KM Symposium) is being held in Baltimore May 14-16, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Unstructured Data
  • 99.
    Trey Grainger worksfor Lucidworks. He is speaking at the DOD & Federal KM Symposium 2019. #KMSymposium (DOD & Federal KM Symposium) is being held in Baltimore May 14-16, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Foreign Key?
  • 100.
    Trey Grainger worksfor Lucidworks. He is speaking at the DOD & Federal KM Symposium 2019. #KMSymposium (DOD & Federal KM Symposium) is being held in Baltimore May 14-16, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzy Foreign Key? (Entity Resolution)
  • 101.
    Trey Grainger worksfor Lucidworks. He is speaking at the DOD & Federal KM Symposium 2019. #KMSymposium (DOD & Federal KM Symposium) is being held in Baltimore May 14-16, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzier Foreign Key? (metadata, latent features)
  • 102.
    Trey Grainger worksfor Lucidworks. He is speaking at the DOD & Federal KM Symposium 2019. #KMSymposium (DOD & Federal KM Symposium) is being held in Baltimore May 14-16, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzier Foreign Key? (metadata, latent features) Not so fast!
  • 105.
    Giant Graph ofRelationships... Trey Grainger works for Lucidworks. He is speaking at the DOD & Federal KM Symposium 2019. #KMSymposium (DOD & Federal KM Symposium) is being held in Baltimore May 14-16, 2019. Trey got his masters degree from Georgia Tech. Trey’s Voicemail
  • 106.
  • 107.
    Source: Trey Grainger, KhalifehAlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Graph Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_skill has_related_skill has_related_skill has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title
  • 108.
    Scoring of NodeRelationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 109.
  • 110.
  • 111.
    Related term vector(for query concept expansion) http://localhost:8983/solr/stack-exchange-health/skg
  • 112.
    Content-based Recommendations (MoreLike This on Steroids) http://localhost:8983/solr/job-postings/skg
  • 113.
    Who’s in Lovewith Jean Grey?
  • 114.
    NER automatically translates… BarackObama was the president of the United States of America. Before that, Obama was a senator. into… <person id="barack_obama">Barack Obama</person> was the <role>president</role> of the <country id="usa">United States of America</country>. Before that, <person id="barack_obama">Obama</person> was a <role>senator</role>. In the search engine, this would become: text: Barack Obama was the president of the United States of America. Before that, Obama was a senator. person: Barack Obama country: United States of America role: [ president, senator ] Named Entity Recognition (NER)
  • 115.
    Differentiating related terms Misspellings:managr => manager Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse Ambiguous Terms*: driver => driver (trucking) ~80% likelihood driver => driver (software) ~20% likelihood Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig *differentiated based upon user and query context
  • 116.
    Use Case: QueryDisambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 117.
    Use Case: QueryDisambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 118.
    A few methodologies: 1)Query Log Mining 2) Semantic Knowledge Graph Knowledge Graph
  • 119.
    Semantic Knowledge Graph:Discovering ambiguous phrases 1) Use a document classification field (i.e. category) as the first level of a graph, and the related terms as the second level to which you traverse. 2) Has the benefit that you don’t need query logs to mine, but it will be representative of your data, as opposed to your user’s intent, so the quality depends on how clean and representative your documents are. Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of categories. Effectively an dynamically-materialized probabilistic graphical model
  • 120.
    Disambiguation by CategoryExample Meaning 1: Restaurant => bbq, brisket, ribs, pork, … Meaning 2: Outdoor Equipment => bbq, grill, charcoal, propane, …
  • 121.
    Disambiguated meanings (representedas term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 122.
    Using the disambiguatedmeanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 123.
    Thought Exercise What doyou think of when I say the word “Facebook”?
  • 124.
    Every term orphrase is a Context-dependent cluster of meaning with an ambiguous label
  • 125.
    What does “love”mean? http://localhost:8983/solr/thesaurus/skg
  • 126.
    What does “love”mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg "embrace"
  • 127.
    What does “love”mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  • 128.
    So what’s myend goal here? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: "machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  • 129.
  • 130.
    Why this SemanticNuance Matters
  • 131.
    Basic Keyword Search (invertedindex, tf-idf, bm25, multilingual text analysis, query formulation, etc.) Query Intent (query classification, semantic query parsing, semantic knowledge graphs, concept expansion, automatic query rewrites, clustering, classification, personalization, question/answer systems, virtual assistants) Automated Relevancy Tuning (Signals, AB Testing/multi-armed bandits/back-testing, genetic algorithms, Deep Learning, Learning to Rank) Self-learning Taxonomies / Entity Extraction (entity recognition, taxonomies, ontologies, business rules, synonyms, etc.) Search Intelligence Spectrum
  • 132.
    • Search istoday’s de-facto User Experience for delivering knowledge and information. • Reflected Intelligence uses content + signals to constantly gain intelligence about your domain, your content, and your users through continuous feedback loops. • Your content already IS a hyper-structured knowledge graph. Smart search technology makes this graph usable so you don’t have to build it all again yourself. • The nuance of natural language really matters. Though all models are wrong, make sure yours are “useful.” • AI and Search represent an evolution in Knowledge Management. They will disrupt some current practices, but ultimately serve as a highly-complementary tool set to most practitioners. Summary
  • 133.
  • 134.