What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
The Apache Solr Semantic Knowledge Graph
1. The Apache Solr Semantic Knowledge Graph
Trey Grainger
SVP of Engineering @ Lucidworks
2. The Apache Solr Semantic Knowledge Graph
Trey Grainger
SVP of Engineering @ Lucidworks
3. • “A compact, auto-generated model for real-time traversal and
ranking of any relationship within a domain”
• A multi-dimensional term-to-term (vs. term-to-document) search
engine
• A tool which enables knowledge modeling and reasoning, natural
language processing, anomaly detection, data cleansing, semantic
search, analytics, data classification, root cause analysis, and
recommendations systems
• It’s kind of like Word2Vec, but better suited for interpreting the
nuanced intent of typical search queries
What is the Semantic Knowledge Graph?
5. Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
• Startup Investor / Advisor
About Me
6. Based in San Francisco, offices
and employees worldwide
Over 300 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary commercial
contributor to the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
7. Highly scalable
search engine and
NoSQL datastore that
gives you instant
access to all your
data.
Surface the insights
that matter most with
the power of machine
learning and artificial
intelligence
Create bespoke data and
discovery applications with
modular UI components for
web and mobile.
The Lucidworks platform provides all of the components needed to design, develop
and deploy smart search experiences for Enterprise and Consumer search
applications
Lucidworks Fusion product suite
Fusion Cloud
combines the power of
Fusion with the
simplicity you’d expect
in a SaaS-based
application.
10. A Graph
DSAA 2016
Montreal
Quebec Canada
Semantic
Knowledge
Graph Paper
Trey
Grainger
Mohammed
Korayem
Andries
Smith
Khalifeh
AlJadda
in_country
Node / Vertex
Edge
11. Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
13. BM25 (Okapi “Best Match” 25th Iteration)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document normalization.
15. Knowledge
Graph
Challenges of building a traditional knowledge graph
Because current knowledge bases / ontology learning systems typically
requires explicitly modeling nodes and edges into a graph ahead of time, this
unfortunately presents several limitations to the use of such a knowledge graph:
• Entities not modeled explicitly as nodes have no known relationships to any other
entities.
• Edges exist between nodes, but not between arbitrary combinations of nodes, and therefore
such a graph is not ideal for representing nuanced meanings of an entity when appearing
within different contexts, as is common within natural language.
• Substantial meaning is encoded in the linguistic representation of the domain that is
lost when the underlying textual representation is not preserved: phrases, interaction of
concepts through actions (i.e. verbs), positional ordering of entities and the phrases containing
those entities, variations in spelling and other representations of entities, the use of adjectives
to modify entities to represent more complex concepts, and aggregate frequencies of
occurrence for different representations of entities relative to other representations.
• It can be an arduous process to create robust ontologies, map a domain into a graph
representing those ontologies, and ensure the generated graph is compact, accurate,
comprehensive, and kept up to date.
16. 01
Knowledge
Graph
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…
17. Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple
levels of relationships between items in a domain.
Knowledge Graph API
Core similarity engine, exposed via API
Any product can leverage the core relationship scoring
engine to score any list of entities against any other list
Full domain support
Keywords, categories, tags, based upon any field on your
documents. Graph is build automatically from the
content representing your domain.
Intersections, overlaps, & relationship
scoring, many levels deep
Users can either provide a list of items to score, or else
have the system dynamically discover the most related
items (or both).
Knowledge
Graph
19. id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
20. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
21. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title
has_related_job_title
25. Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
26. Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
has_related_job_title
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
28. Knowledge
Graph
Use Case: Document Summarization /
Content-based Recommendations
Experiment: Pass in raw text
(extracting phrases as needed), and
rank their similarity to the documents
using the SKG.
Additionally, can traverse the graph
to “related” entities/keyword phrases
NOT found in the original document
Applications: Content-based and
multi-modal recommendations
(no cold-start problem), data cleansing
prior to clustering or other ML methods,
semantic search / similarity scoring
32. Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
34. Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job title
classifications of the jobs to
which they applied)
3) Segment the search term => related search terms list by classification,
to return a separate related terms list per classification
2) Create a probabilistic graphical model of those classifications mapped
to each keyword phrase.
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
35. Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean
and representative your documents are.
Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of
categories. Effectively an dynamically-materialized probabilistic graphical model
36. Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
37. Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
38. Knowledge
Graph
Use Case: Query Expansion
Experiment: Take an initial query, and expand keyword
phrases to include the most related entities to that query
Example:
39. The Semantic Search Problem
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
44. Per-field options:
{field_name}.fallback string
An alternative (usually free-text) field to query in case foreground returns less than min_popularity. Example: {field_name}.fallback=content
{field_name}.facet-field string
A suffix to find a copy-field version of the value in {field_name} to use for the node during traversal. Example: {field_name}.facet-field=id-name
{field_name}.key string
A parallel key / id field to use to represent a particular value. Typically used to disambiguate multiple entities (each with a unique id) sharing the same textual name
Example:{field.v1.key}=id
57. Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: single node/shard only
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
58. • Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
59. Based in San Francisco, offices
and employees worldwide
Over 300 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary sponsor of
the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
63. Knowledge Graph – Potential Use Cases
Cross-walk between Types
• Have an ID field, but want to enable free text search
on the most associated entity with that ID?
• Have a “state” (geo) search box, but want to accept
any free-text location and map it to the right state?
• Have an old classification taxonomy and want to
know how the values from the old system now map
into the new values?
Build User Profiles from Search Logs
• If someone searches for “Java”, and then “JQuery”,
and then “CSS”, and then “JSP”, what do those have
in common?
• What if they search for “Java”, and then “C++”, and
then “Assembly”?
Discover Relationships Between Anything
• If I want to become a data scientist and know
Python, what libraries should I learn?
• If my last job was mid-level software engineer and
my current job is Engineering Lead, what are my
most likely next roles?
Traverse arbitrarily deep, Sort on anything
• Build an instant co-occurrence matrix, sort the top
values by their relatedness, and then add in any
number of additional dimensions (RAM permitting).
Data Cleansing
• Have dirty taxonomies and need to figure out which
items don’t belong?
• Need to understand the conceptual cohesion of a
document (vs spammy or off-topic content)?
Knowledge
Graph
64. Knowledge
Graph
Future Work
• Semantic Search (more experiments)
• Search Engine Relevancy Algorithms
• Trending Topics
• Recommendation Systems
• Root Cause Analysis
• Abuse Detection
65. Knowledge
Graph
Conclusion
Applications:
The Semantic Knowledge Graph has numerous applications, including
automatically building ontologies, identification of trending topics over time,
predictive analytics on timeseries data, root-cause analysis surfacing concepts
related to failure scenarios from free text, data cleansing, document
summarization, semantic search interpretation and expansion of queries,
recommendation systems, and numerous other forms of anomaly detection.
Main contribution of this paper:
The introduction (and open sourcing) of the the Semantic Knowledge Graph, a
novel and compact new graph model
that can dynamically materialize and score the relationships between any arbitrary
combination of entities represented within a corpus of documents.
67. Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple
levels of relationships between items in a domain.
Knowledge Graph API
Core similarity engine, exposed via API
Any product can leverage the core relationship scoring
engine to score any list of entities against any other list
Full domain support
Keywords, categories, tags, based upon any field on your
documents. Graph is build automatically from the
content representing your domain.
Intersections, overlaps, & relationship
scoring, many levels deep
Users can either provide a list of items to score, or else
have the system dynamically discover the most related
items (or both).
Knowledge
Graph