This document discusses using knowledge graphs to ground large language models (LLMs) and improve their abilities. It begins with an overview of generative AI and LLMs, noting their opportunities but also challenges like lack of knowledge and inability to verify sources. The document then proposes using a knowledge graph like Neo4j to provide context and ground LLMs, describing how graphs can be enriched with algorithms, embeddings and other data. Finally, it demonstrates how contextual searches and responses can be improved by retrieving relevant information from the knowledge graph to augment LLM responses.
LLMs are obviously the hot new topic, and they are fall within the larger exciting field of Generative AI
LLMs focus on generating LANGUAGE specifically, whether that is natural language such as in summarizing texts or driving chat interactions, or more specialized language scenarios like code.
I’m pretty sure that most of us in this room have tested them out in some capacity - and it is really amazing how well they draft summaries or even write python and cypher snippets :-)
Last week I used chatgpt to name this presentation - i gave it an outline and asked it to come up with a catchy title
there are so many ways these models can assist our work and lives
LLMs are obviously the hot new topic, and they are fall within the larger exciting field of Generative AI
LLMs focus on generating LANGUAGE specifically, whether that is natural language such as in summarizing texts or driving chat interactions, or more specialized language scenarios like code.
I’m pretty sure that most of us in this room have tested them out in some capacity - and it is really amazing how well they draft summaries or even write python and cypher snippets :-)
Last week I used chatgpt to name this presentation - i gave it an outline and asked it to come up with a catchy title
there are so many ways these models can assist our work and lives
LLMs are probabilistic models, that take a lot of data, and a lot of time and resources to train
This results in some limitations - because it takes a significant amount of effort to train, you can’t keep constantly adding data, so some of the models are years out of date
Also this means that rather than giving you a factual answer, they will give you the most PROBABLE answer
This answer is greatly dependent on what it has been exposed to - which opens risk for bias
Emphasis
This training data is also largely based on general knowledge, as opposed to having specific expertise in your organization, and is not easily auditable or explainable
For example - if you are a financial analyst and ask it a question like - which managers own a particular stock, it will give you back a list of people rather than understand that Manager in this case is an institutional investment manager
So what do we do?
Not using LLMs will mean being left in the past, but how do we get LLMs to attend more math classes as Sudhir said, and be less creative when we need them to be factual?
Adding a Neo4j knowledge graph to your LLMs help improve relevance and explainability of answers by grounding it and ensuring the LLM response is underpinned with facts.
This process is called Retrieval Augmented Generation. This approach combines the creative power of your generative model with the stored data of your knowledge graph for more accurate responses.
User asks a question
LLM directed to look for information in Neo4j
Response generated based on the trusted content that is curated by the organization
Why specifically use Neo4j to ground your LLMs?
Graphs inherently capture CONTEXT
When you are looking for information about an individual or organization, just identifying them isn’t enough, you also want to know your complete relationship with that person or company, how do their interactions with you compare to those of their peers.
Graph data science allows you to enrich your data - you can infer new information about those entities based on their relationships -
like the risk associated with an entity, or identify other similar entities
So you have all this wonderful information - how do you get it to consumers?
When we combine graphs and LLMs - any user, regardless of graph skill, can access reliable, and relevant information
Let’s walk through these steps
I’ll go through this step quickly because I think you’ve had exposure to this before
Building a knowledge graph, we start with data in its native form, identifying all of the sources that have information related to your business problem. In most cases, this information is silo’d across a large variety of sources, and data formats
We then ingest all of these sources into a graph where information between your unstructured and structured data becomes connected.
This allows us to start thinking about step 2, enrichment, where we can layer in semantics and use graph algorithms and queries to derive implicit relationships and capture even more context.
But just this first step gives us so much information - by making the natural relationships explicit, we can start answering questions that would otherwise take hours of preprocessing and data joining with tabular data. Things like
What is the overall fraud risk of an account across channels, including risk implications that are based on more distant relationships with flagged accounts.
Or tracing ingredients and how a drug is made to how the drug reacts with different genes, which would normally require pulling information about the manufacturing of a particular drug, form, active ingredients, additional ingredients, and more
And comparing patterns of user activities seeing which actions most commonly lead to purchase behaviors
—---NEXT--------
—----------------
Once have building blocks - expand to capture heter ogeneous entities in business
Life science - manufacturing drug - not simple - what is drug, tylenol - has active ingredients + other ingredients to make pill - lots of research has been done to know how drug reacts with gene, - how relates to other genes, / diseases, how is drug made - captures a lot of aspects of business - what completes connection from genes to diseases to targets – would take a TON of preprocessing without graph, in graph - just traversing relationships.
Trivial questions - answered looking at relationships only.
Otherwise very difficult
Graph data science is also a science driven approach but in this world, data scientists use relationships to answer questions.. and the good news is it’s not EITHER/OR….it’s BOTH we can leverage both since relationships are already in your data..
TO ADD/EDIT: few diff use cases, / applications. Had a few banks attending - v. high level, if want to generalize model looks something like this - account holder, bank, pii info, SS, phone number, address etc. also transactions etc not represented here. Also see ppl sharing information. if want to ask question, query this, bc way data stored in graph db, traverse - multiple hops.
want to see dependence btw drugs, easily more than 4 hops - better to have graph than joins
retail - graph db top use case, similarity btw customer based on behavior, traverse find similarity btw customers, btw products, correlation of products purchased together, marketing, search, adopted by most top retailers.
looking at local pattern matches with Cypher - in Finserv, given an account holder, how many flagged accounts are out - links to metadata, bank accounts mobile apps, if know what data model looks like - diagram makes up entities and semantic relationships connecting them - 4 hops out makes sense. more a semantic statement than pure traversal statement.
Old —
So a lot of times when we start talking to prospects, they’re really excited about using graph data science, and we’re very excited that they want to use it, but we always want to start off as simple as possible and see what they can accomplish with just cypher. Sometimes just changing your data structure from relational to graph and realizing those relationships that are hidden in your data can answer your questions without requiring algorithms or ML.
That said, a lot of problems DO require machine learning, it’s just about using the right tools for the problem at hand.
For example in finance and identifying fraud, you might want to evaluate fraud risk for a particular applicant, and with relational data you might be able to see ok, this person shares a SSN with another account one hop or one relationship away, but you wouldn’t have visibility into how many flagged accounts are 4 hops out, and how many common identifiers are shared with those nodes? And so just with cypher queries you can get a lot of important information that might influence your decision to approve a request or flag an account as risky for further follow up. And that can be helpful to just be cypher because you want those interactions to be very fast and at that millisecond response level.
Life sciences also has some really cool applications, traversing the graph to understand what connects genes to diseases to targets, and understanding those more distant relationships to help improve drug repurposing studies.
And then we have a lot of common marketing and recommendations applications with cypher queries. So if two customers buy the same product, can we compare their other purchases to what else they are likely to buy. Once we start moving away from just traversing relationships and trying to compare user patterns at the network level, that’s when we want to start bringing in graph data science.
And keep in mind, this is not an either or situation, you can use BOTH cypher queries AND graph machine learning and algorithms because we start with relationships already realized in the data.
What may seem like complex questions become simple queries, so we’re already seeing value - but there’s still more information hidden in the patterns of relationships, waiting to be pulled out.
I’ll go through this step quickly because I think you’ve had exposure to this before
Building a knowledge graph, we start with data in its native form, identifying all of the sources that have information related to your business problem. In most cases, this information is silo’d across a large variety of sources, and data formats
We then ingest all of these sources into a graph where information between your unstructured and structured data becomes connected.
This allows us to start thinking about step 2, enrichment, where we can layer in semantics and use graph algorithms and queries to derive implicit relationships and capture even more context.
We can tease out that enrichment in this several ways
Create Derived relationships With queries and simple pattern matches
Algorithms to capture things within the larger context of the graph, like relative importance - or influence in a social network - or community information
And finally we can capture the broader context of a node by capturing its neighborhood and representing it as an as a two dimensional vector or embedding.
Node Embeddings capture the broader neighborhood around an entity as a vector,
Allow us to identify similar entities based on their social network or activities like in customer journey
katie and phani more similar than Emil
they will also become important when we start thinking about our final step of integrating with LLMs
All of these can directly surface information in the graph, or be used as input features to help enrich ML Pipelines - so for example in a fraud use case we may
create direct relationships between users with shared identifiers
generate influence scores as well as community statistics
and then capture their interactions with users multiple hops away as an embedding to predict fraud
The types of insights and enrichment we can derive from the use of graph algorithms is quite varied because the underlying approaches themselves are varied, and map onto traditional ML concepts
we have unsupervised and supervised approaches
Unsupervised - clustering and association based on similarity/distance in network
Dimensionality reduction - how impt a node is
Capture local network in embeddings
Also supervised approaches ….
Choosing the right category is all about your use case
Centrality is about finding influencers or most important nodes in your graph, and understanding how they impact the network. For example key bridge points between subnetworks can highlight risk points or vulnerabilities in supply chains. Or Influencers in a social network
Within Pathfinding - shortest path can link drug targets to the most likely outcomes or side effects, or find optimal paths in routing scenarios within supply chain or energy sectors.
Community detection enables more targeted recommendations, customer segmentation, and entity resolution
Similarity algorithms are also key to associate similar nodes (JACOB), and enable what-if analysis and disaster recovery scenarios
Embeddings - as I mentioned earlier help us capture higher dimension signals and use them in predictive pipelines or to eval similarity - in this case similarity from perspective of network features
And finally link prediction can be used to enrich the graph and handle data quality challenges as well as find the next best recommendation or action for an individual
DONT’ READ —---------
Instead of hand crafting these image representations, we can learn them. That is known as representation learning. We can have a neural network which takes the image as an input and outputs a vector, which is the feature representation of the image. This is the representation learner.
Choosing the right category is all about your use case
Centrality is about finding influencers or most important nodes in your graph, and understanding how they impact the network. For example key bridge points between subnetworks can highlight risk points or vulnerabilities in supply chains. Or Influencers in a social network
Within Pathfinding - shortest path can link drug targets to the most likely outcomes or side effects, or find optimal paths in routing scenarios within supply chain or energy sectors.
Community detection enables more targeted recommendations, customer segmentation, and entity resolution
Similarity algorithms are also key to associate similar nodes (JACOB), and enable what-if analysis and disaster recovery scenarios
Embeddings - as I mentioned earlier help us capture higher dimension signals and use them in predictive pipelines or to eval similarity - in this case similarity from perspective of network features
And finally link prediction can be used to enrich the graph and handle data quality challenges as well as find the next best recommendation or action for an individual
DONT’ READ —---------
Instead of hand crafting these image representations, we can learn them. That is known as representation learning. We can have a neural network which takes the image as an input and outputs a vector, which is the feature representation of the image. This is the representation learner.
Choosing the right category is all about your use case
Centrality is about finding influencers or most important nodes in your graph, and understanding how they impact the network. For example key bridge points between subnetworks can highlight risk points or vulnerabilities in supply chains. Or Influencers in a social network
Within Pathfinding - shortest path can link drug targets to the most likely outcomes or side effects, or find optimal paths in routing scenarios within supply chain or energy sectors.
Community detection enables more targeted recommendations, customer segmentation, and entity resolution
Similarity algorithms are also key to associate similar nodes (JACOB), and enable what-if analysis and disaster recovery scenarios
Embeddings - as I mentioned earlier help us capture higher dimension signals and use them in predictive pipelines or to eval similarity - in this case similarity from perspective of network features
And finally link prediction can be used to enrich the graph and handle data quality challenges as well as find the next best recommendation or action for an individual
DONT’ READ —---------
Instead of hand crafting these image representations, we can learn them. That is known as representation learning. We can have a neural network which takes the image as an input and outputs a vector, which is the feature representation of the image. This is the representation learner.
So far we have our graph, we’ve enriched it with algorithms, and we can use vectors to query it, so we’re ready for LLMs
Same is true of powering semantic search
Vector similarity is just the first step - with a particular search query you can find the most relevant mentions or documents in your data
As you layer in graph queries then you start to understand the more complete context, who authored that particular document, what job title do they have, who else is on their team, and who else contributed to or has access to that information?
With graph algorithms we can further inform the author’s influence in the organization or who has similar interests, and so on
These last two steps are critical in supplying context and helping to filter responses to the most relevant information.
They also play an important role in improving search where text and other document focused references are sparse.
Example: using search to find people who have the knowledge relevant to help solve a bug you’ve encountered
you may search for FAQ or customer support tickets using vector search, and look at the authors.
But then that will limit you to only people who have taken the time to document their knowledge, and those people may no longer even be in a relevant role or with the company.
With the graph, you can take another step out in context and ask - what skills did the author have, and who else has those skills and is in a relevant role currently?
Based on their influence - How trusted are their responses?
You can increase the breadth and the relevance of your search to make sure you find the right people to solve the problem.
This type of maturity in approach enables you to tackle those challenges we talked about at the beginning
By taking advantage of both semantic embeddings provided by LLMs, AND the rich, human readable data in a knowledge graph, you can reduce hallucinations and get domain specific responses.
Now, Let's dive a little deeper into the process. How exactly do we achieve this?
We start by connecting to neo4j and getting the graph schema we want to understand.
We provide that as context to AI, along with some training examples which primes it with custom domain knowledge.
We can also optionally fine tune the model with even more domain specific examples
And then given it’s knowledge about the database, it generates a query to interact with the database, and summarize the resulting data
We are now able to answer our questions using natural language.
This grounding is called retrieval augmented generation
(explain diagram)
Knowledge Graphs enable linking of structured and unstructured data for most accurate & relevant results
Ultimately - whether you’re using GenAI or other ML - graphs play a key role in getting the full value out of your data
Graphs pull out the relationships that natively exists
Can then gain more context and valuable information using graph algos
Interpret and communicate results with analysts using bloom
Integrate with a wide variety of technologies including large language models and ai platforms to further extend the value of your data.
Excited to see how applications of graphs and genai develop, and come chat with me during the reception, i’d love to hear what you’re working on!
Xo
katie
Let’s see this in practice
Going to use a Business intelligence Data model
In this use case it’s important to bring together information across the organization
connect critical vulnerabilities, software applications, data centers and other IT assets etc to be able to have insights about how each vulnerability - be it cybersecurity or physical will impact your business processes
This is a simplified version, but under the covers it is much more than a few joins, it’s a network of information - each one of these icons has its own web of hierarchies and relationships
So this is what the actual data model looks like, where we have all of those different domains of data, people, locations, vulnerabilities, and are able to surface those connections to easily understand how a particular vulnerability may impact software applications which therefore impacts business tasks.
That’s what we’ll go through in the demo
A little bit on what’s happening under the hood before i jump into the demo
We’re using LangChain to orchestrate interactions btw database and the selected LLM
Few shot prompting, just providing a few examples
Maruti showed you a bit of the behind the scenes details for what a prompt looks like
A little bit on what’s happening under the hood before i jump into the demo
We’re using LangChain to orchestrate interactions btw database and the selected LLM
Few shot prompting, just providing a few examples
Maruti showed you a bit of the behind the scenes details for what a prompt looks like
Before I leave you - I want to highlight where Neo4j sits more broadly in the Generative AI ecosystem.
data and LLMs can help capture knowledge and accelerated data ingest process, particularly with unstructured data (didn’t focus on this topic today - but thankfully Maruti highlighted this in his presentation - if you have questions, come find me)
Then in the graph we can enrich the factual data of your org using graph queries, algorithms, and data science work flows
Finally - Leverage LLMs to enable chatbots and semantic search to power applications - you can increase accuracy and relevance of recommendations, facilitate sourcing of institutional knowledge in customer service and ticketing use cases, there are so many ways you can apply this technology