The Apache Solr Semantic Knowledge Graph

The Apache Solr Semantic Knowledge Graph
Trey Grainger
SVP of Engineering @ Lucidworks

• “A compact, auto-generated model for real-time traversal and
ranking of any relationship within a domain”
• A multi-dimensional term-to-term (vs. term-to-document) search
engine
• A tool which enables knowledge modeling and reasoning, natural
language processing, anomaly detection, data cleansing, semantic
search, analytics, data classification, root cause analysis, and
recommendations systems
• It’s kind of like Word2Vec, but better suited for interpreting the
nuanced intent of typical search queries
What is the Semantic Knowledge Graph?

• Introduction
• Overview of Semantic Knowledge Graph
- Index Structure
- Graph Traversal Mechanics
- Edge Weights/Relevancy Scoring
• Use Cases
- Document Summarization / Content-based Recommendations
- Predictive Analytics
- Data Cleansing
- Document Classification & Enrichment
- Query Disambiguation
- Semantic Search / Concept Expansion
• Installation
• Live Demos
• Q&A
Agenda

Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
• Startup Investor / Advisor
About Me

Based in San Francisco, offices
and employees worldwide
Over 300 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary commercial
contributor to the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%

Highly scalable
search engine and
NoSQL datastore that
gives you instant
access to all your
data.
Surface the insights
that matter most with
the power of machine
learning and artificial
intelligence
Create bespoke data and
discovery applications with
modular UI components for
web and mobile.
The Lucidworks platform provides all of the components needed to design, develop
and deploy smart search experiences for Enterprise and Consumer search
applications
Lucidworks Fusion product suite
Fusion Cloud
combines the power of
Fusion with the
simplicity you’d expect
in a SaaS-based
application.

Fusion powers search for the brightest companies in the
world.

A Graph
DSAA 2016
Montreal
Quebec Canada
Semantic
Knowledge
Graph Paper
Trey
Grainger
Mohammed
Korayem
Andries
Smith
Khalifeh
AlJadda
in_country
Node / Vertex
Edge

Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index

/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents

BM25 (Okapi “Best Match” 25th Iteration)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document normalization.

Knowledge
Graph
Challenges of building a traditional knowledge graph
Because current knowledge bases / ontology learning systems typically
requires explicitly modeling nodes and edges into a graph ahead of time, this
unfortunately presents several limitations to the use of such a knowledge graph:
• Entities not modeled explicitly as nodes have no known relationships to any other
entities.
• Edges exist between nodes, but not between arbitrary combinations of nodes, and therefore
such a graph is not ideal for representing nuanced meanings of an entity when appearing
within different contexts, as is common within natural language.
• Substantial meaning is encoded in the linguistic representation of the domain that is
lost when the underlying textual representation is not preserved: phrases, interaction of
concepts through actions (i.e. verbs), positional ordering of entities and the phrases containing
those entities, variations in spelling and other representations of entities, the use of adjectives
to modify entities to represent more complex concepts, and aggregate frequencies of
occurrence for different representations of entities relative to other representations.
• It can be an arduous process to create robust ontologies, map a domain into a graph
representing those ontologies, and ensure the generated graph is compact, accurate,
comprehensive, and kept up to date.

01
Knowledge
Graph
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…

Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple
levels of relationships between items in a domain.
Knowledge Graph API
Core similarity engine, exposed via API
Any product can leverage the core relationship scoring
engine to score any list of entities against any other list
Full domain support
Keywords, categories, tags, based upon any field on your
documents. Graph is build automatically from the
content representing your domain.
Intersections, overlaps, & relationship
scoring, many levels deep
Users can either provide a list of items to score, or else
have the system dynamically discover the most related
items (or both).
Knowledge
Graph

id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …

Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5

Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title

Knowledge
Graph
Materialization of new nodes through shared documents
engineer
engineers
software engineer*
(materialized node)
engineer*
(materialized node)
Software
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
links_to
links_to

Knowledge
Graph
From the academic paper:
Structure:
Single-level Traversal / Scoring:
Multi-level Traversal / Scoring:

Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph

Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55

Knowledge
Graph
Use Case: Document Summarization /
Content-based Recommendations
Experiment: Pass in raw text
(extracting phrases as needed), and
rank their similarity to the documents
using the SKG.
Additionally, can traverse the graph
to “related” entities/keyword phrases
NOT found in the original document
Applications: Content-based and
multi-modal recommendations
(no cold-start problem), data cleansing
prior to clustering or other ML methods,
semantic search / similarity scoring

Knowledge
Graph
Use Case: Predictive Analytics

Knowledge
Graph
Use Case: Data Cleansing
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness": 0.9765, "popularity":369 },
{ "value":”spark", "relatedness": 0.9634, "popularity":15653 },
{ "value":".net", "relatedness": 0.5417, "popularity":17683 },
{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },
{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },
{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }
Foreground Query: "Hadoop"
Experiment: Data analyst
manually annotated 500
pairs of terms found together
in real query logs as
“relevant” or “not relevant”
Results: SKG removed 78%
of the terms while maintaining
a 95% accuracy at removing
the correct noisy pairs from
the input data.

Use Case: Document Classification & Enrichment Knowledge
Graph

Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

A few methodologies:
1) Query Log Mining
2) Semantic Knowledge Graph
Knowledge Graph

Query Log Mining: Discovering ambiguous phrases
1) Classify users who ran each
search in the search logs
(i.e. by the job title
classifications of the jobs to
which they applied)
3) Segment the search term => related search terms list by classification,
to return a separate related terms list per classification
2) Create a probabilistic graphical model of those classifications mapped
to each keyword phrase.

Semantic Knowledge Graph: Discovering ambiguous phrases
1) Exact same concept, but use
a document classification
field (i.e. category) as the first
level of your graph, and the
related terms as the second
level to which you traverse.
2) Has the benefit that you don’t need query logs to mine, but it will be representative
of your data, as opposed to your user’s intent, so the quality depends on how clean
and representative your documents are.
Additional Benefit: Multi-dimensional disambiguation and dynamic materialization of
categories. Effectively an dynamically-materialized probabilistic graphical model

Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …

Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

Knowledge
Graph
Use Case: Query Expansion
Experiment: Take an initial query, and expand keyword
phrases to include the most related entities to that query
Example:

The Semantic Search Problem
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)

#1: Pull, Build, Start Solr
cd ~/demo
git clone https://github.com/apache/lucene-solr.git
git checkout releases/lucene-solr/6.5.1
cd lucene-solr/solr
ant server
bin/solr -c –Denable.runtime.lib=true
#2: Pull and Build Semantic Knowledge Graph
cd ~/demo
git clone https://github.com/treygrainger/semantic-knowledge-graph.git
git checkout solr_6.5.1
mvn package
#3: Install Plugin
curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @semantic-
knowledge-graph-1.0-SNAPSHOT.jar http://localhost:8983/solr/.system/blob/skg.jar
#4: Create Collection
curl http://localhost:8983/solr/admin/collections?action=CREATE&name=demo-collection

#5: Register Plugin with Collection
curl http://localhost:8983/solr/demo-collection/config
-H 'Content-type:application/json' -d
'{
"add-runtimelib": { "name":"skg.jar", "version":1 } }'
#6: Register Response Writer
'{ "add-queryresponsewriter": {
"name": "skg", "runtimeLib": true, "class":
"com.apache.solr.search.skg.responsewriter.KnowledgeGraphResponseWriter"} }’
#7: Register Request Handler
'{ "add-requesthandler" : { "name": "/skg",
"class":"com.apache.solr.search.skg.KnowledgeGraphHandler",
"defaults":{ "defType":"edismax", "wt":"json"},
"invariants":{"wt":"skg"}, "runtimeLib": true } }'

Per-field options:
{field_name}.fallback string
An alternative (usually free-text) field to query in case foreground returns less than min_popularity. Example: {field_name}.fallback=content
{field_name}.facet-field string
A suffix to find a copy-field version of the value in {field_name} to use for the node during traversal. Example: {field_name}.facet-field=id-name
{field_name}.key string
A parallel key / id field to use to represent a particular value. Typically used to disambiguate multiple entities (each with a unique id) sharing the same textual name
Example:{field.v1.key}=id

Who’s in Love with Jean Grey?

Build a Co-occurrence Matrix
http://localhost:8983/solr/job-postings/skg

Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg

Score keywords from within a document
http://localhost:8983/solr/job-postings/skg

Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: 39grainger

Knowledge
Graph
Populating the Graph

https://lucidworks.com/webinar/solr-6-deep-dive-sql-graph/
top(n="5", sort="count(*) desc",
gatherNodes(movielens,
top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*)
)
),
walk="node->user_id_i",
gather="movie_id_i", count(*)
)
)
Distributed Graph Traversal Streaming Expressions
Covered in more detail
in deep-dive SQL &
Graph webinar:

Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: single node/shard only
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]

• Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3

Based in San Francisco, offices
and employees worldwide
Over 300 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary sponsor of
the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%

Lucidworks Fusion Architecture

Fusion powers search for the brightest companies in the world.

Knowledge Graph – Potential Use Cases
Cross-walk between Types
• Have an ID field, but want to enable free text search
on the most associated entity with that ID?
• Have a “state” (geo) search box, but want to accept
any free-text location and map it to the right state?
• Have an old classification taxonomy and want to
know how the values from the old system now map
into the new values?
Build User Profiles from Search Logs
• If someone searches for “Java”, and then “JQuery”,
and then “CSS”, and then “JSP”, what do those have
in common?
• What if they search for “Java”, and then “C++”, and
then “Assembly”?
Discover Relationships Between Anything
• If I want to become a data scientist and know
Python, what libraries should I learn?
• If my last job was mid-level software engineer and
my current job is Engineering Lead, what are my
most likely next roles?
Traverse arbitrarily deep, Sort on anything
• Build an instant co-occurrence matrix, sort the top
values by their relatedness, and then add in any
number of additional dimensions (RAM permitting).
Data Cleansing
• Have dirty taxonomies and need to figure out which
items don’t belong?
• Need to understand the conceptual cohesion of a
document (vs spammy or off-topic content)?
Knowledge
Graph

Knowledge
Graph
Future Work
• Semantic Search (more experiments)
• Search Engine Relevancy Algorithms
• Trending Topics
• Recommendation Systems
• Root Cause Analysis
• Abuse Detection

Knowledge
Graph
Conclusion
Applications:
The Semantic Knowledge Graph has numerous applications, including
automatically building ontologies, identification of trending topics over time,
predictive analytics on timeseries data, root-cause analysis surfacing concepts
related to failure scenarios from free text, data cleansing, document
summarization, semantic search interpretation and expansion of queries,
recommendation systems, and numerous other forms of anomaly detection.
Main contribution of this paper:
The introduction (and open sourcing) of the the Semantic Knowledge Graph, a
novel and compact new graph model
that can dynamically materialize and score the relationships between any arbitrary
combination of entities represented within a corpus of documents.

The Apache Solr Semantic Knowledge Graph

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Apache Solr Semantic Knowledge Graph

Similar to The Apache Solr Semantic Knowledge Graph (20)

More from Trey Grainger

More from Trey Grainger (20)

Recently uploaded

Recently uploaded (20)

The Apache Solr Semantic Knowledge Graph