SlideShare a Scribd company logo
1 of 18
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
A Fusion of Machine Learning
and Graph Analysis for Free-
Form Data Entry Clustering
Dr. Andrew Flinders
Joel Linford
Data Scientists
Northrop Grumman Corporation – Space Sector
© 2022 Neo4j, Inc. All rights reserved.
Repair Narratives
Building Clusters
BERT Embeddings [1]
Constelations
2
© 2022 Neo4j, Inc. All rights reserved.
3
Motivation
Problem
• Maintenance Records
• Need to identify patterns and structures present in free form text
• Finding general topics can be challenging
Hypothesis
• We hypothesized that the combination of large language models (deep
learning), clustering techniques (shallow learning), and graph databases
(graph algorithms) could be used to map and retain these patterns.
© 2022 Neo4j, Inc. All rights reserved.
Narratives – free form text with vital info
• 10 REPLACED ALL PISTONS
• 11 CLEANED HUBCAPS
• 12 COMPLETED DRIVE SHAFT CO
• 13 REPLACED WATER PUMP
• 14 SCHEDULED MAINTENANCE
• 15 CLEANED INJECTORS
• 16 CLEANED FLOOR MATS
• 17 PATCHED WIRING IN CAB
• 18 LABELED SEATING ASSIGNMENTS
• 19 NO FOUND ON SPARK PLUGS
4
This technique will
work for any free
form text where
there is a reason to
believe that there are
patterns or trends.
Here are a couple of
examples of the text
we are working with.
© 2022 Neo4j, Inc. All rights reserved.
BERT Embeddings [1]
BERT is a language model which embeds text into semantically sensitive
vectors (as opposed to a Bag of Words model, which is mostly semantically
insensitive.)
These vectors are extremely effective at allowing text to be used for machine
learning.
BERT is a Deep Neural Net (bringing Deep learning and Transfer Learning to
play.)
How BERT was trained (and why it did not use twitter)
5
© 2022 Neo4j, Inc. All rights reserved.
Using BERT [1] embeddings for clustering
6 Image from “Attention Is All You Need.” [1]
“We fixed the thing” –[BERT]-> [0.124, 0.432, 0.4523, ….. , 1.2432]
© 2022 Neo4j, Inc. All rights reserved.
Clustering Algorithms
We tested several clustering algorithms. My favorite is the OPTICS [2]
clustering algorithm as implemented by Sci-Kit Learn [3]. We also tested the
DBScan [4] Method and the KNN [5] method.
7
The optics algorithm detects
dense groupings in the data and
designates those as cluster
cores. It then allows the cluster
to grow to a certain point and
excludes outliers. This can be
helpful for identifying unique
entries. Image from https://scikit-
learn.org/stable/auto_examples/
cluster/plot_optics.html#sphx-glr-
auto-examples-cluster-plot-
optics-py [3]
© 2022 Neo4j, Inc. All rights reserved.
8
Cluster
1
Cluster
2
Cluster
3
Cluster
4
WATER PUMP
SEIZEDREQS WATER
PUMP REPLACE
REPLACE WATER
PUMP
REPLACE WATER
PUMP
REPLACE WATER
PUMP REQS CO
REPLACE WATER
PUMP
REPLACED THE
WATER PUMP ALL
CODES CLEARED
REPLACED THE
WATER PUMP ALL
CODES CLEARED
REPLACED WATER
PUMP CODES
CLEARED
REPLACED WATER
PUMP ALL CODES
CLEARED
REPLACED WATER
PUMP ALL CODES
CLEARED
TEAM CLEANED
PLUGS ALL CODES
CLEARED
TEAM CLEANED
PLUGS ALL CODES
CLEARED
TEAM CLEANED
PLUGS ALL CODES
CLEARED
TEAM CLEANED
PLUGS ALL CODES
CLEARED
AIR CONDITIONER
LEAKING
REFRIGERANT
REPLACED
AIR CONDITIONER
WAS LEAKING
REFRIGERANT AIR
CONDITIONER
CHILLER ALL CODES
CLEARED
REPLACED AIR
CONDITIONER FOR
LOW REFRIGERANT
© 2022 Neo4j, Inc. All rights reserved.
9
Similarity within Clusters
1
Performed
Corrosion
Control
Performed
Corrosion
Control
Performed
Corrosion
Control
Corrosion
Control
Performed
Corosion
Control
Performed
CC
[in]
[in]
[in]
[in]
[in]
[in]
Key:
Narrative
Cluster
© 2022 Neo4j, Inc. All rights reserved.
10
Dissimilarity for Un-clustered Entries
Perf MaintCOntrol
Sandwich
found in
pump
31542.1240
• Really unique problems
• Misspelled entries
• Data entered incorrectly
CND
© 2022 Neo4j, Inc. All rights reserved.
11
1. Average embeddings saved
on each cluster
2. Euclidean distance calculated
between each cluster center
Cluster Linking
Change colors
Key:
Narrative
Cluster
© 2022 Neo4j, Inc. All rights reserved.
12
Current/Future Work: Graph Algorithms
• Centrality – what are the most important nodes?
• Pathfinding
• Similarity
• Community Detection
• What graph algorithms have you guys used and had
success with? We are looking to try some soon.
© 2022 Neo4j, Inc. All rights reserved.
13
Text
Language
Model
Clusteri
ng
Graph
Building
Graph
Algos
Fine
tuning
Clustering
Algos
Graph
Design
LABEL EDGES
[input]
[then] [then]
[then]
[adjustable] [adjustable] [adjustable]
© 2022 Neo4j, Inc. All rights reserved.
14
Current/Future Work: Fine tuning BERT [1]
• Fine tuning a language model will improve efficacy with our dataset
(probably helpful in almost every application.)
• Considering the training set for language models, it is probable that they
will struggle with slang and modern connotations. (Someone should study
this.)
• We have been looking into fine tuning, and we think we have it working,
but we have not tested it yet.
© 2022 Neo4j, Inc. All rights reserved.
15
Current/Future Work: Summary Stats
• Clearly the clusters have meaning
◦ But summary statistics seem uninteresting
• This is likely due to over-simplification
◦ I.e. reduction of 500+ degree vectors to one Euclidean
distance just lost too much information.
• Averaging seems to be… ok but not amazing.
© 2022 Neo4j, Inc. All rights reserved.
16
Patterns interlocking systems
Fluid
Change
Valve
Gasket
Piston
Engine Transmition
Sched.
Maint.
[subset_of]
[subset_of]
[subset_of]
[subset_of]
[subset_of]
[subset_of]
[subset_of]
Key:
Broad Clustering
Narrow Clustering
© 2022 Neo4j, Inc. All rights reserved.
17
© 2022 Neo4j, Inc. All rights reserved.
© 2022 Neo4j, Inc. All rights reserved.
18
Thank you!

More Related Content

Similar to A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clustering

Keras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningKeras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningDr. Ananth Krishnamoorthy
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceNeo4j
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfNeo4j
 
Ultime Novità di Prodotto Neo4j
Ultime Novità di Prodotto Neo4j Ultime Novità di Prodotto Neo4j
Ultime Novità di Prodotto Neo4j Neo4j
 
2019 4-nn-and-dl-tao wang@unc-v2
2019 4-nn-and-dl-tao wang@unc-v22019 4-nn-and-dl-tao wang@unc-v2
2019 4-nn-and-dl-tao wang@unc-v2Tao Wang
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowNeo4j
 
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the CloudNew! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the CloudNeo4j
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphNeo4j
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
 
Final project report format
Final project report formatFinal project report format
Final project report formatMasud Sarkar
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsHannes Tschofenig
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Roman Nikitchenko
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreStfalcon Meetups
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghData Con LA
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data ScienceNeo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data ScienceNeo4j
 
Component Based Model Driven Development of Mission Critical Defense Applicat...
Component Based Model Driven Development of Mission Critical Defense Applicat...Component Based Model Driven Development of Mission Critical Defense Applicat...
Component Based Model Driven Development of Mission Critical Defense Applicat...Remedy IT
 
Road to NODES 2023: Graphing Relational Databases
Road to NODES 2023: Graphing Relational DatabasesRoad to NODES 2023: Graphing Relational Databases
Road to NODES 2023: Graphing Relational DatabasesNeo4j
 
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analyticsMetta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analyticsEduardo Gaspar
 

Similar to A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clustering (20)

Keras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningKeras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learning
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
 
Ultime Novità di Prodotto Neo4j
Ultime Novità di Prodotto Neo4j Ultime Novità di Prodotto Neo4j
Ultime Novità di Prodotto Neo4j
 
2019 4-nn-and-dl-tao wang@unc-v2
2019 4-nn-and-dl-tao wang@unc-v22019 4-nn-and-dl-tao wang@unc-v2
2019 4-nn-and-dl-tao wang@unc-v2
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
 
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the CloudNew! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
New! Neo4j AuraDS: The Fastest Way to Get Started with Data Science in the Cloud
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
 
Final project report format
Final project report formatFinal project report format
Final project report format
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data ScienceNeo4j: The path to success with Graph Database and Graph Data Science
Neo4j: The path to success with Graph Database and Graph Data Science
 
Trends in DNN compression
Trends in DNN compressionTrends in DNN compression
Trends in DNN compression
 
Component Based Model Driven Development of Mission Critical Defense Applicat...
Component Based Model Driven Development of Mission Critical Defense Applicat...Component Based Model Driven Development of Mission Critical Defense Applicat...
Component Based Model Driven Development of Mission Critical Defense Applicat...
 
Road to NODES 2023: Graphing Relational Databases
Road to NODES 2023: Graphing Relational DatabasesRoad to NODES 2023: Graphing Relational Databases
Road to NODES 2023: Graphing Relational Databases
 
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analyticsMetta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
 

More from Neo4j

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansQIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansNeo4j
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...Neo4j
 
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosBBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosNeo4j
 
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Neo4j
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Neo4j
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j
 
Neo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j
 
SWIFT: Maintaining Critical Standards in the Financial Services Industry with...
SWIFT: Maintaining Critical Standards in the Financial Services Industry with...SWIFT: Maintaining Critical Standards in the Financial Services Industry with...
SWIFT: Maintaining Critical Standards in the Financial Services Industry with...Neo4j
 

More from Neo4j (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansQIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
 
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosBBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
 
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
 
Neo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with Graph
 
SWIFT: Maintaining Critical Standards in the Financial Services Industry with...
SWIFT: Maintaining Critical Standards in the Financial Services Industry with...SWIFT: Maintaining Critical Standards in the Financial Services Industry with...
SWIFT: Maintaining Critical Standards in the Financial Services Industry with...
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 

A Fusion of Machine Learning and Graph Analysis for Free-Form Data Entry Clustering

  • 1. © 2022 Neo4j, Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. A Fusion of Machine Learning and Graph Analysis for Free- Form Data Entry Clustering Dr. Andrew Flinders Joel Linford Data Scientists Northrop Grumman Corporation – Space Sector
  • 2. © 2022 Neo4j, Inc. All rights reserved. Repair Narratives Building Clusters BERT Embeddings [1] Constelations 2
  • 3. © 2022 Neo4j, Inc. All rights reserved. 3 Motivation Problem • Maintenance Records • Need to identify patterns and structures present in free form text • Finding general topics can be challenging Hypothesis • We hypothesized that the combination of large language models (deep learning), clustering techniques (shallow learning), and graph databases (graph algorithms) could be used to map and retain these patterns.
  • 4. © 2022 Neo4j, Inc. All rights reserved. Narratives – free form text with vital info • 10 REPLACED ALL PISTONS • 11 CLEANED HUBCAPS • 12 COMPLETED DRIVE SHAFT CO • 13 REPLACED WATER PUMP • 14 SCHEDULED MAINTENANCE • 15 CLEANED INJECTORS • 16 CLEANED FLOOR MATS • 17 PATCHED WIRING IN CAB • 18 LABELED SEATING ASSIGNMENTS • 19 NO FOUND ON SPARK PLUGS 4 This technique will work for any free form text where there is a reason to believe that there are patterns or trends. Here are a couple of examples of the text we are working with.
  • 5. © 2022 Neo4j, Inc. All rights reserved. BERT Embeddings [1] BERT is a language model which embeds text into semantically sensitive vectors (as opposed to a Bag of Words model, which is mostly semantically insensitive.) These vectors are extremely effective at allowing text to be used for machine learning. BERT is a Deep Neural Net (bringing Deep learning and Transfer Learning to play.) How BERT was trained (and why it did not use twitter) 5
  • 6. © 2022 Neo4j, Inc. All rights reserved. Using BERT [1] embeddings for clustering 6 Image from “Attention Is All You Need.” [1] “We fixed the thing” –[BERT]-> [0.124, 0.432, 0.4523, ….. , 1.2432]
  • 7. © 2022 Neo4j, Inc. All rights reserved. Clustering Algorithms We tested several clustering algorithms. My favorite is the OPTICS [2] clustering algorithm as implemented by Sci-Kit Learn [3]. We also tested the DBScan [4] Method and the KNN [5] method. 7 The optics algorithm detects dense groupings in the data and designates those as cluster cores. It then allows the cluster to grow to a certain point and excludes outliers. This can be helpful for identifying unique entries. Image from https://scikit- learn.org/stable/auto_examples/ cluster/plot_optics.html#sphx-glr- auto-examples-cluster-plot- optics-py [3]
  • 8. © 2022 Neo4j, Inc. All rights reserved. 8 Cluster 1 Cluster 2 Cluster 3 Cluster 4 WATER PUMP SEIZEDREQS WATER PUMP REPLACE REPLACE WATER PUMP REPLACE WATER PUMP REPLACE WATER PUMP REQS CO REPLACE WATER PUMP REPLACED THE WATER PUMP ALL CODES CLEARED REPLACED THE WATER PUMP ALL CODES CLEARED REPLACED WATER PUMP CODES CLEARED REPLACED WATER PUMP ALL CODES CLEARED REPLACED WATER PUMP ALL CODES CLEARED TEAM CLEANED PLUGS ALL CODES CLEARED TEAM CLEANED PLUGS ALL CODES CLEARED TEAM CLEANED PLUGS ALL CODES CLEARED TEAM CLEANED PLUGS ALL CODES CLEARED AIR CONDITIONER LEAKING REFRIGERANT REPLACED AIR CONDITIONER WAS LEAKING REFRIGERANT AIR CONDITIONER CHILLER ALL CODES CLEARED REPLACED AIR CONDITIONER FOR LOW REFRIGERANT
  • 9. © 2022 Neo4j, Inc. All rights reserved. 9 Similarity within Clusters 1 Performed Corrosion Control Performed Corrosion Control Performed Corrosion Control Corrosion Control Performed Corosion Control Performed CC [in] [in] [in] [in] [in] [in] Key: Narrative Cluster
  • 10. © 2022 Neo4j, Inc. All rights reserved. 10 Dissimilarity for Un-clustered Entries Perf MaintCOntrol Sandwich found in pump 31542.1240 • Really unique problems • Misspelled entries • Data entered incorrectly CND
  • 11. © 2022 Neo4j, Inc. All rights reserved. 11 1. Average embeddings saved on each cluster 2. Euclidean distance calculated between each cluster center Cluster Linking Change colors Key: Narrative Cluster
  • 12. © 2022 Neo4j, Inc. All rights reserved. 12 Current/Future Work: Graph Algorithms • Centrality – what are the most important nodes? • Pathfinding • Similarity • Community Detection • What graph algorithms have you guys used and had success with? We are looking to try some soon.
  • 13. © 2022 Neo4j, Inc. All rights reserved. 13 Text Language Model Clusteri ng Graph Building Graph Algos Fine tuning Clustering Algos Graph Design LABEL EDGES [input] [then] [then] [then] [adjustable] [adjustable] [adjustable]
  • 14. © 2022 Neo4j, Inc. All rights reserved. 14 Current/Future Work: Fine tuning BERT [1] • Fine tuning a language model will improve efficacy with our dataset (probably helpful in almost every application.) • Considering the training set for language models, it is probable that they will struggle with slang and modern connotations. (Someone should study this.) • We have been looking into fine tuning, and we think we have it working, but we have not tested it yet.
  • 15. © 2022 Neo4j, Inc. All rights reserved. 15 Current/Future Work: Summary Stats • Clearly the clusters have meaning ◦ But summary statistics seem uninteresting • This is likely due to over-simplification ◦ I.e. reduction of 500+ degree vectors to one Euclidean distance just lost too much information. • Averaging seems to be… ok but not amazing.
  • 16. © 2022 Neo4j, Inc. All rights reserved. 16 Patterns interlocking systems Fluid Change Valve Gasket Piston Engine Transmition Sched. Maint. [subset_of] [subset_of] [subset_of] [subset_of] [subset_of] [subset_of] [subset_of] Key: Broad Clustering Narrow Clustering
  • 17. © 2022 Neo4j, Inc. All rights reserved. 17
  • 18. © 2022 Neo4j, Inc. All rights reserved. © 2022 Neo4j, Inc. All rights reserved. 18 Thank you!