Unleash Power of Neo4j GPT Harmonize Cancer Data

©2023 FI Consulting. All rights reserved. ficonsulting.com
Unleash the Power of Neo4j with GPT and Large
Language Models: Harmonizing Concepts from
Cancer Research Data
October 25, 2023
1

Speakers
2
Mark A. Jensen, PhD is the Director of
Data Science in the Center for Technical
Operations Support at the Frederick
National Laboratory for Cancer Research
Robert Chang, Modeling & Analytics
Domain Leader at FI Consulting

Cancer Research Data Commons (CRDC)
3
Cancer Data Aggregator
Aggregate by patient, sample, study, disease, tissue, etc.
Clinical Proteomics Imaging
Genomics Immuno-
oncology
Animal Models Cancer Biomarkers
Cancer Research
Data Commons
0100111
0
0100001
1
0100100
1
Cancer Data Hub
Enable submission to one
or multiple commons
In development:

CRDC is a Federation of Going Concerns
§ Each CRDC node has its own data systems, business processes, stakeholders,
and users
§ Each has its own purpose-built data model that enables data ingestion, query, and distribution.
§ Each has large, ongoing inflows and outflows of data today.
§ Some have large legacy datasets that contain inconsistencies and gaps
§ A top-down, prescriptive approach to standardization is not feasible.
§ Retrospective data cleanup is highly manual and very occasional.
§ CRDC nodes have their own sets of data submitters:
§ Larger submitters (such as well-funded consortia) have data wranglers and engineers to automate
data transformation and submission – once their study is “onboarded”
§ Smaller submitters, with more diverse data and fewer IT resources, are set to become a larger part of
the submitter base, because of new data sharing requirements for individual NIH grants.
4

Common Idiosyncrasies in Submitted Data
5

Example: "Species"
6

Typical Study Onboarding Workflow
7

8
Solution Overview
§ Our solution uses Graph Technology (Neo4J) and Natural Language Processing (NLP)
techniques.
§ NLP is needed for the solution to go beyond simple string-based similarity, and Graph
enables the solution to be fast, efficient, and scalable.

9
Data Preparation – GPT for Synonym Generation

10
Data Preparation – GPT for Parsing Long Text
§ Generate keywords based on long text to facilitate querying.
§ GPT was used to generate keywords, other parsers only returned single words, GPT returned
phrases.
“We established a preclinical testing program that has created >300 genomically-characterized pediatric solid tumor
patient-derived xenograft (PDX) models between the pediatric oncology programs at Memorial Sloan Kettering Cancer
Center and University of California San Francisco. We propose to leverage this large portfolio of models across a
diversity of diseases, along with the deep expertise of the team, to establish an NCI Pediatric In Vivo Testing Program
(Ped-In Vivo-TP) Research Team focused on pediatric bone and soft tissue sarcomas, renal tumors, desmoplastic small
round cell tumor (DSRCT) and other rare pediatric solid tumors.”
Key words from GPT3.5 model: Preclinical testing program; Pediatric solid tumor; Patient-derived xenograft
(PDX); NCI Pediatric In Vivo Testing Program; Rare pediatric solid tumors

11
Vectorizing Text Data
Cosine
Similarity
BioBERT-
Base
OpenAI
Mean 0.9327 0.9269
Standard
Deviation 0.0364 0.0446
Embedding Models:
§ BioBERT-Base
§ OpenAI
Evaluation methods:
§ Visualization of Clusters
§ Dimensionality reduction by T-SNE
§ Cosine Similarity of top-5 similar node

12
Calculate Text Similarity
After generating dense vectors, we calculate
cosine similarity scores between each pair
of vectors to find matches.

§ Nodes from data's
structure:
§ Category
§ Header
§ Value
§ Synthetic nodes:
§ Synonyms of headers
§ Synonyms of values
§ Tokens
13
Adding Nodes to our Graph

§ Edges from data's
structure
§ Edges from synonyms
§ Synthetic edges
§ Fast Vector
Similarity Search
§ Cosine Similarity
14
Adding Edges to our Graph

Use Case #1: Correcting Typos
15
Value on Dataset Correct Value on Dataset Cosine Similarity
subcarina subcarinal 96.504 %
irinitecan irinotecan 98.506 %
bronvhoscopy bronchoscopy 99.262 %
right paratrac right paratracheal 98.186 %
liposarcoma, well diferentiated liposarcoma, well differentiated 98.776 %
supraclav supraclavicular 99.574 %
pleiomorphic liposarcoma pleomorphic liposarcoma 99.917 %
cisplastin cisplatin 99.221 %
hospitialization hospitalization 98.369 %

Value “subcarina” from the ACRIN dataset may be a typo and could match value
“subcarinal” from the same dataset.
HEADER
VALUE
Anatomic Site - other
VALUE
16

Value “bronvhoscopy” from the ACRIN NSCLC dataset is a typo and could match value
“bronchoscopy” from the same dataset.
HEADER VALUE VALUE
procedures performed
for diagnostic workup
17
HEADER
sites of progression

Use Case #2: Mapping to Standards
18
Value on Dataset Similar Value on Dataset Cosine Similarity
lower third of esophagus abdominal esophagus 99.994 %
abdominal fibromatosis desmoid fibromatosis 99.607 %
acute monoblastic leukemia acute monocytic leukemia 99.998 %
androblastoma sertoli leydig cell tumor 99.721 %
glial-neuronal neoplasm glioma 99.678 %
benign fibrocyst benign fibrosis 97.492 %
hemangioendothelioma malignant hemangiosarcoma 99.361 %
pinealoma pineocytoma 99.957 %
diagnosis dx 99.885 %
acute myelogenous leukemia acute myeloid leukemias 99.347 %
pemetrexed alimta 99.414 %

Value “lower third of esophagus” from the CCDI dataset is similar to value “abdominal
esophagus” from the TCGA dataset.
HEADER (CCDI) VALUE HEADER (TCGA)
VALUE
19

Value “acute monoblastic leukemia” from the CCDI dataset is similar to value
“acute monocytic leukemia” from the same dataset.
20

21
Neo4J Queries for Node Classification
Random Walk Procedure:
§ Utilizes paths created in the
graph by various methods
(cosine similarity, ontology,
SME) the random walk can
identify headers that are in
near proximity.
§ Run Neo4J Random
Walk procedure using Cypher
to generate random walk
paths.

22
Neo4J Queries for Node Classification cont.
Using the Neo4j Python Driver, execute the GDS (graph data science) procedure and count
header occurrences in a node's random walk.

23
GPT User Interface for Graph Management

Thank you!
https://www.slideshare.net/neo4j/government-graphsummit-and-then-there-were-15-standards

Unleash Power of Neo4j GPT Harmonize Cancer Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unleash Power of Neo4j GPT Harmonize Cancer Data

Similar to Unleash Power of Neo4j GPT Harmonize Cancer Data (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Unleash Power of Neo4j GPT Harmonize Cancer Data