Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

1
Combining Explicit and
Latent Web Semantics
Paul Groth - @pgroth
Elsevier Labs
BigNet : WWW 2018
Thanks to Ron Daniel, Brad Allen & the Labs Team
Empowering
KnowledgeTM
for Maintaining Knowledge Graphs

2
Outline
Goal: to tell you our current thinking and to get your feedback
• Why we’re interested
• What we’ve tried
• What we’re missing
• Webby Data
• 2 Sources of Semantics
• State of the art
• What’s missing
Warning: The back half is like a probably incomplete
literature review so think of this as pointers 

5
EMMeT (Elsevier Merged Medical Taxonomy)
EMMeT is a multilingual, concept-based clinical ontology
• Multilingual: English, French, Spanish
• Concept-based: All terms, synonyms, translations, mappings are
related to a unique identifier (“IMUI”)
• Ontology: Provides semantic relationships between concepts
(symptoms of a disease, treatment procedures of a disease,
complications of a disease or a procedure, etc…)
EMMeT is a controlled reference terminology
• Based on Unified Medical Language System (UMLS), standard clinical
terminologies as well as Elsevier proprietary vocabularies and lists of
acronyms
• Explicitly mapped to international medical standards (SNOMED-CT,
ICD-9-CM, ICD-10-CM, LOINC, RXNorm, CVX, etc.) and Elsevier’s
vocabularies (Gold Standard, EMTREE, etc.)
EMMeT is current
• Continuously updated, and released every 12 weeks for automatic
indexing
• Updated daily and available via an API for manual tagging access
• Maintained by a team of medical terminology experts,

6
Automated Tagging
Manual Tagging/
Data Structuring
Products and platforms using EMMeT
Clinical Solutions
ClinicalKey Global
ClinicalKey ANZ
ClinicalKey France
ClinicalKey Espanol
ClinicalKey Nursing
ClinicalKey German
ClinicalKey Nursing ANZ
ClinicalKey Brazil
Amirsys Decision Point
RP/STMJ
Health Advance
The Lancet
Cell
LexisNexis
MedMal Nav
LN Insight
Legend
In production
In Pilot
In Pipeline
Nursing Education
Mosby’s Dictionary
Clinical Solutions
PoC - Clinical Overviews
ClinicalKey HL7 API
Health Analytics
IDS FHIR API/Apps
Dorland’s Dictionary
Patient Engagement
Gold Standard CP
ERC
Content 2.0
Nursing Education
Sherpath
EMEALAAP
MedEnact
RP/STMJ (SCT)
Health Advance
The Lancet
Cell

7
EMMeT Clinical Knowledge Graph

8
Rankings of EMMeT’s ontological relationships
• Relationships are ranked according to 5-tiered ranking model: for simplicity and accessibility.
• 10: best option;
• 9: second option. When the rank of 10 is not applicable;
• 8: given two concepts that are too general to be directly related to a specific disease;
• 7: is used as an outlier.
• 6: default / non validated.
Relationship
Ranking Criteria
10 9 8 7
has cause most common common sometimes rare
has clinical finding most common common sometimes rare
has_complication severity (disease) severe/death high moderate low morbiditiy
has_complication prevalence (disease) Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs Rare occurrence
has_complication severity (procedure) critical/death major moderate minor
has comorbidity strongly associated Commonly associated Sometimes associated Rarely associated
has screening procedure best choice is done sometimes done rarely done
has risk factor strongly associated Commonly associated Sometimes associated Rarely associated
has diagnostic procedure best choice commonly done sometimes done rarely done
has differential diagnosis Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs/ low prevalence Rare occurrence
has drug best choice 2nd line 3rd line rarely given
has contraindication drug Strongly avoid/black box Commonly avoid Sometimes avoid Rarely avoid
has treatment procedure best choice commonly done sometimes done rarely done
has prevention Best option common option sometimes advised rarely advised
has physician specialty specific specialty general/specialty broad rare
has device standard device acceptable device sometimes used rarely used

9
From EMMeT to H-Graph
• Based on EMMeT
• Support more complex relations including patient context (Clinical Overview content + more)
• Flexible and extensible model to support links to content, model treatment strategies, numeric values, temporal
data, etc. Age, sex, weight, … are very simple context.
In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate
or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48
hours or is uncertain. NICE Guideline Atrial Fibrillation: Management
• Continue to support existing indexing pipelines (e.g. ClinicalKey), and tagging use cases (e.g. Clinical Overviews)
From EMMeT… …To H-Graph

11
Universal schemas
• … are a specific technique from the Information Extraction and the Automatic Knowledge Base
Completion literature
• … are an unsupervised method to ‘learn’ by combining text extracts with existing knowledge base
assertions
• Applications:
• Extend a medical knowledge base
• scan incoming literature to suggest new additions to EMMeT and show the
underlying evidence to the taxonomy editor.
• scan literature backlog to find evidence for data already in EMMeT
• Literature Surveillance
• scan incoming literature to find existing facts even if expressed in very different ways
• find new concepts in the literature related to an existing EMMeT concept*. Let taxonomy
editor decide whether to add new concept and relation to EMMeT

12
Open Information Extraction
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• One weird trick for open information extraction …
• ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction

13
ReVerb output
After ReVerb pulls out noun phrases, match them up to EMMeT concepts
Discard rare concepts, relations, or relations that are not used with many different concepts
# SD Documents Scanned 14,000,000
Extracted ReVerb Triples 473,350,566

14
Universal schemas - Initialization
• Method to combine ‘facts’ found by
machine reading with stronger
assertions from ontology.
• Build ExR matrix with entity-pairs
as rows and relations as columns.
• Relation columns can come from
EMMeT, or from ReVerb
extractions.
• Cells contain 1.0 if that pair of
entities is connected by that
relation.

15
Universal schemas - Prediction
• Factorize matrix to ExK and KxR,
then recombine.
• “Learns” the correlations between
text relations and EMMeT relations,
in the context of pairs of objects.
• Find new triples to go into EMMeT
e.g., (glaucoma,
has_alternativeProcedure,
biofeedback)

16
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
ONTOLOGY MAINTENANCE
• Pretty good F measure around -.7
• Good enough with human in the loop
• But we want more!

17
Paulheim, Heiko. "Knowledge graph refinement: A survey of approaches and evaluation
methods." Semantic web 8.3 (2017): 489-508.
WHERE TO GO?

18
MORE THAN LINK PREDICTION
• Data has deep hierarchy –link prediction flattens this
• Data has hooks into specific content
• Schemas are increasingly richly defined – not just a
single type
• N-ary relations

19
OUR KG’S SHARE PROPERTIES WITH WEB KGS
Ringler, Daniel, and Heiko Paulheim. "One knowledge graph to rule them all? Analyzing
the differences between DBpedia, YAGO, Wikidata & co." Joint German/Austrian
Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2017.

20
The Web of Data
http://webdatacommons.org/structureddata/
2017-12/stats/stats.html
http://lodlaundromat.org

21
Two sources of semantics
1.Dereferenceablity
2.Rules

22
Dereferenceablity
Looking definitions up – Natural Language and Programmatic

24
Pay attention to the underlying data
Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An
Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574]

25
Embed more
Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types,
descriptions, and context. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (pp. 2681-2690).

26
Embed more
Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer:
Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the
Knowledge Capture Conference. ACM, 2017.

27
Social Semantics?
de Rooij, S., Beek, W., Bloem, P., van Harmelen, F., & Schlobach, S. (2016, October).
Are Names Meaningful? Quantifying Social Meaning on the Semantic Web.
In International Semantic Web Conference (pp. 184-199). Springer, Cham.
• Distributional semantics for
identifiers (NTN)
• But uses the global network
• Could we use the discussion
space as well?
NTN - Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013).
Reasoning with neural tensor networks for knowledge base
completion. In Advances in neural information processing
systems (pp. 926-934).

28
schema:dateModified a rdf:Property ;
rdfs:label "dateModified" ;
schema:domainIncludes schema:CreativeWork,
schema:DataFeedItem ;
schema:rangeIncludes schema:Date,
schema:DateTime ;
rdfs:comment "The date on which the CreativeWork was
most recently modified or when the item's entry was
modified within a DataFeed." .
schema:datePublished a rdf:Property ;
rdfs:label "datePublished" ;
schema:domainIncludes schema:CreativeWork ;
schema:rangeIncludes schema:Date ;
rdfs:comment "Date of first broadcast/publication." .
schema:disambiguatingDescription a rdf:Property ;
rdfs:label "disambiguatingDescription" ;
schema:domainIncludes schema:Thing ;
schema:rangeIncludes schema:Text ;
rdfs:comment "A sub property of description. A short
description of the item used to disambiguate from other,
similar items. Information from other properties (in
particular, name) may be necessary for the description to
be useful for disambiguation." ;
rdfs:subPropertyOf schema:description .
https://www.w3.org/TR/rdf11-mt/
Rules

29
Injecting Background Knowledge as Constraints
Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation
extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (pp. 1119-1129)

30
Learning Rules
Yang, Fan, Zhilin Yang, and William W. Cohen. "Differentiable learning of logical
rules for knowledge base reasoning." Advances in Neural Information Processing
Systems. 2017.

31
Combing Both – supporting complex reasoning with subsymbolic representations
Rocktäschel, T., & Riedel, S. (2017). End-to-end
differentiable proving. In Advances in Neural Information
Processing Systems (pp. 3791-3803).

32
Future
Welbl, J., Stenetorp, P., & Riedel, S. (2017). Constructing Datasets for
Multi-hop Reading Comprehension Across Documents. arXiv preprint
arXiv:1710.06481.
•Scale
•The knowledge base == text?
•Multi-hop reasoning
•Is everything end-to-end
differentiable

33
Conclusion
• In practice: data is webby data
• Messy
• Interconnected
• Constraints and rules associated
• Semantic Web: semantics can come from multiple different sources
• Explicit & implicit
• Take advantage of those sources
• Knowledge graphs benefit from inference
• Your thoughts?
• Thanks & We’re hiring!
p.groth@elsevier.com | pgroth.com
labs.elsevier.com

35
INTEGRATION OF LARGE NUMBERS OF DATA SOURCES
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE ,
vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language
infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to
Wikipedia categories based frequencies
• Wordnet is built by psycholinguists

36
Units & Measurement Annotations
• Time
• Dosage
• Probability
• Percent
• Count
• Not handled yet
Find numbers followed by a unit name or abbreviation (perhaps with scale factor like k, m, G, …). Provide value
normalized to SI units. Also provide type of measurement (time, temperature, length, mass, dosage, etc.) based on
unit. Handling tolerances, ranges, probabilities, and counts adds complexity. Conjunctions not yet handled but very
important.
Current work – identify the property being measured (e.g. dosages of AA, indomethacin, HtE, leptin, etc.)
Additionally at 120 min following glucose administration, the 100 mg/kg 5g and 5e groups had
significantly (P ⩽ 0.005) a greater drop in blood glucose than the 10 and 50 mg/kg groups.
In the mouse xenograft model of LLC cells in C57BL/6J mice, once daily administration of AA (50 and
100 mg/kg) inhibited tumor growth in a dose-dependent manner (Fig. 6A and C).
Groups of Swiss mice (n = 6) were treated (p.o.) with vehicle, indomethacin (10 mg/kg-Roche®) or HtE
(50, 100 or 200 mg/kg) 1 h before administration of carrageenan at 2.5% (Sigma-Aldrich®) injected
subcutaneously into the plantar region of the left hind paw and phosphate buffer saline (PBS) in
right hind paw.
In the experiments designed to study the antidepressant-like effect of the repeated treatment (for
14 days) of EET, the immobility time in the TST and the locomotor activity in the open-field were
assessed in independent groups of mice 24 h after the last daily administration of EET (10–100
mg/kg, p.o.).
Hoppers containing chow were removed from the cages 1 h before the administration of leptin
[depending on studies, 5 mg/kg or 2.5 mg/kg, ip; mouse recombinant leptin obtained from Dr. A.F.

Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

Similar to Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs (20)

More from Paul Groth

More from Paul Groth (15)

Recently uploaded

Recently uploaded (20)

Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

Editor's Notes