Combining Explicit and
Latent Web Semantics
Paul Groth - @pgroth
BigNet : WWW 2018
Thanks to Ron Daniel, Brad Allen & the Labs Team
for Maintaining Knowledge Graphs
Goal: to tell you our current thinking and to get your feedback
• Why we’re interested
• What we’ve tried
• What we’re missing
• Webby Data
• 2 Sources of Semantics
• State of the art
• What’s missing
Warning: The back half is like a probably incomplete
literature review so think of this as pointers
EMMeT (Elsevier Merged Medical Taxonomy)
EMMeT is a multilingual, concept-based clinical ontology
• Multilingual: English, French, Spanish
• Concept-based: All terms, synonyms, translations, mappings are
related to a unique identifier (“IMUI”)
• Ontology: Provides semantic relationships between concepts
(symptoms of a disease, treatment procedures of a disease,
complications of a disease or a procedure, etc…)
EMMeT is a controlled reference terminology
• Based on Unified Medical Language System (UMLS), standard clinical
terminologies as well as Elsevier proprietary vocabularies and lists of
• Explicitly mapped to international medical standards (SNOMED-CT,
ICD-9-CM, ICD-10-CM, LOINC, RXNorm, CVX, etc.) and Elsevier’s
vocabularies (Gold Standard, EMTREE, etc.)
EMMeT is current
• Continuously updated, and released every 12 weeks for automatic
• Updated daily and available via an API for manual tagging access
• Maintained by a team of medical terminology experts,
Products and platforms using EMMeT
ClinicalKey Nursing ANZ
Amirsys Decision Point
PoC - Clinical Overviews
ClinicalKey HL7 API
IDS FHIR API/Apps
Gold Standard CP
EMMeT Clinical Knowledge Graph
Rankings of EMMeT’s ontological relationships
• Relationships are ranked according to 5-tiered ranking model: for simplicity and accessibility.
• 10: best option;
• 9: second option. When the rank of 10 is not applicable;
• 8: given two concepts that are too general to be directly related to a specific disease;
• 7: is used as an outlier.
• 6: default / non validated.
10 9 8 7
has cause most common common sometimes rare
has clinical finding most common common sometimes rare
has_complication severity (disease) severe/death high moderate low morbiditiy
has_complication prevalence (disease) Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs Rare occurrence
has_complication severity (procedure) critical/death major moderate minor
has comorbidity strongly associated Commonly associated Sometimes associated Rarely associated
has screening procedure best choice is done sometimes done rarely done
has risk factor strongly associated Commonly associated Sometimes associated Rarely associated
has diagnostic procedure best choice commonly done sometimes done rarely done
has differential diagnosis Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs/ low prevalence Rare occurrence
has drug best choice 2nd line 3rd line rarely given
has contraindication drug Strongly avoid/black box Commonly avoid Sometimes avoid Rarely avoid
has treatment procedure best choice commonly done sometimes done rarely done
has prevention Best option common option sometimes advised rarely advised
has physician specialty specific specialty general/specialty broad rare
has device standard device acceptable device sometimes used rarely used
From EMMeT to H-Graph
• Based on EMMeT
• Support more complex relations including patient context (Clinical Overview content + more)
• Flexible and extensible model to support links to content, model treatment strategies, numeric values, temporal
data, etc. Age, sex, weight, … are very simple context.
In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate
or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48
hours or is uncertain. NICE Guideline Atrial Fibrillation: Management
• Continue to support existing indexing pipelines (e.g. ClinicalKey), and tagging use cases (e.g. Clinical Overviews)
From EMMeT… …To H-Graph
• … are a specific technique from the Information Extraction and the Automatic Knowledge Base
• … are an unsupervised method to ‘learn’ by combining text extracts with existing knowledge base
• Extend a medical knowledge base
• scan incoming literature to suggest new additions to EMMeT and show the
underlying evidence to the taxonomy editor.
• scan literature backlog to find evidence for data already in EMMeT
• Literature Surveillance
• scan incoming literature to find existing facts even if expressed in very different ways
• find new concepts in the literature related to an existing EMMeT concept*. Let taxonomy
editor decide whether to add new concept and relation to EMMeT
Open Information Extraction
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• One weird trick for open information extraction …
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction
After ReVerb pulls out noun phrases, match them up to EMMeT concepts
Discard rare concepts, relations, or relations that are not used with many different concepts
# SD Documents Scanned 14,000,000
Extracted ReVerb Triples 473,350,566
Universal schemas - Initialization
• Method to combine ‘facts’ found by
machine reading with stronger
assertions from ontology.
• Build ExR matrix with entity-pairs
as rows and relations as columns.
• Relation columns can come from
EMMeT, or from ReVerb
• Cells contain 1.0 if that pair of
entities is connected by that
Universal schemas - Prediction
• Factorize matrix to ExK and KxR,
• “Learns” the correlations between
text relations and EMMeT relations,
in the context of pairs of objects.
• Find new triples to go into EMMeT
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
• Pretty good F measure around -.7
• Good enough with human in the loop
• But we want more!
Paulheim, Heiko. "Knowledge graph refinement: A survey of approaches and evaluation
methods." Semantic web 8.3 (2017): 489-508.
WHERE TO GO?
MORE THAN LINK PREDICTION
• Data has deep hierarchy –link prediction flattens this
• Data has hooks into specific content
• Schemas are increasingly richly defined – not just a
• N-ary relations
OUR KG’S SHARE PROPERTIES WITH WEB KGS
Ringler, Daniel, and Heiko Paulheim. "One knowledge graph to rule them all? Analyzing
the differences between DBpedia, YAGO, Wikidata & co." Joint German/Austrian
Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2017.
The Web of Data
Two sources of semantics
Looking definitions up – Natural Language and Programmatic
Pay attention to the underlying data
Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An
Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574]
Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types,
descriptions, and context. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (pp. 2681-2690).
Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer:
Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the
Knowledge Capture Conference. ACM, 2017.
de Rooij, S., Beek, W., Bloem, P., van Harmelen, F., & Schlobach, S. (2016, October).
Are Names Meaningful? Quantifying Social Meaning on the Semantic Web.
In International Semantic Web Conference (pp. 184-199). Springer, Cham.
• Distributional semantics for
• But uses the global network
• Could we use the discussion
space as well?
NTN - Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013).
Reasoning with neural tensor networks for knowledge base
completion. In Advances in neural information processing
systems (pp. 926-934).
schema:dateModified a rdf:Property ;
rdfs:label "dateModified" ;
rdfs:comment "The date on which the CreativeWork was
most recently modified or when the item's entry was
modified within a DataFeed." .
schema:datePublished a rdf:Property ;
rdfs:label "datePublished" ;
schema:domainIncludes schema:CreativeWork ;
schema:rangeIncludes schema:Date ;
rdfs:comment "Date of first broadcast/publication." .
schema:disambiguatingDescription a rdf:Property ;
rdfs:label "disambiguatingDescription" ;
schema:domainIncludes schema:Thing ;
schema:rangeIncludes schema:Text ;
rdfs:comment "A sub property of description. A short
description of the item used to disambiguate from other,
similar items. Information from other properties (in
particular, name) may be necessary for the description to
be useful for disambiguation." ;
rdfs:subPropertyOf schema:description .
Injecting Background Knowledge as Constraints
Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation
extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (pp. 1119-1129)
Yang, Fan, Zhilin Yang, and William W. Cohen. "Differentiable learning of logical
rules for knowledge base reasoning." Advances in Neural Information Processing
Combing Both – supporting complex reasoning with subsymbolic representations
Rocktäschel, T., & Riedel, S. (2017). End-to-end
differentiable proving. In Advances in Neural Information
Processing Systems (pp. 3791-3803).
Welbl, J., Stenetorp, P., & Riedel, S. (2017). Constructing Datasets for
Multi-hop Reading Comprehension Across Documents. arXiv preprint
•The knowledge base == text?
•Is everything end-to-end
• In practice: data is webby data
• Constraints and rules associated
• Semantic Web: semantics can come from multiple different sources
• Explicit & implicit
• Take advantage of those sources
• Knowledge graphs benefit from inference
• Your thoughts?
• Thanks & We’re hiring!
email@example.com | pgroth.com
INTEGRATION OF LARGE NUMBERS OF DATA SOURCES
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE ,
vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to
Wikipedia categories based frequencies
• Wordnet is built by psycholinguists
Units & Measurement Annotations
• Not handled yet
Find numbers followed by a unit name or abbreviation (perhaps with scale factor like k, m, G, …). Provide value
normalized to SI units. Also provide type of measurement (time, temperature, length, mass, dosage, etc.) based on
unit. Handling tolerances, ranges, probabilities, and counts adds complexity. Conjunctions not yet handled but very
Current work – identify the property being measured (e.g. dosages of AA, indomethacin, HtE, leptin, etc.)
Additionally at 120 min following glucose administration, the 100 mg/kg 5g and 5e groups had
significantly (P ⩽ 0.005) a greater drop in blood glucose than the 10 and 50 mg/kg groups.
In the mouse xenograft model of LLC cells in C57BL/6J mice, once daily administration of AA (50 and
100 mg/kg) inhibited tumor growth in a dose-dependent manner (Fig. 6A and C).
Groups of Swiss mice (n = 6) were treated (p.o.) with vehicle, indomethacin (10 mg/kg-Roche®) or HtE
(50, 100 or 200 mg/kg) 1 h before administration of carrageenan at 2.5% (Sigma-Aldrich®) injected
subcutaneously into the plantar region of the left hind paw and phosphate buffer saline (PBS) in
right hind paw.
In the experiments designed to study the antidepressant-like effect of the repeated treatment (for
14 days) of EET, the immobility time in the TST and the locomotor activity in the open-field were
assessed in independent groups of mice 24 h after the last daily administration of EET (10–100
Hoppers containing chow were removed from the cages 1 h before the administration of leptin
[depending on studies, 5 mg/kg or 2.5 mg/kg, ip; mouse recombinant leptin obtained from Dr. A.F.
100+ years of expert knowledge
On the left side we see one concept, breast cancer, and a number of pieces of informaiton about it such as synonyms, parent and child concepts, etc. On the right we see some ontological relations from breast cancer to other concepts, such as
(breast cancer, has diagnostic procedure, breast biopsy).
One of the major differences between EMMeT and what is in UMLS is that we not only provide the basic 3-part relationship, such as (breast cancer, has_treatment, radical mastectomy), we also provide information about the ‘strength’ of that relation according to current medical evidence.
Excerpt from National Institute for Health and Care Excellence (In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain.
In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain.
Using EMMeT, and some code and data we already had, he built a quick prototype and tested it. Performance (in terms of accuracy of predictions) was surprisingly high.
Unsupervised is very important because it means the construction of the rough underlying knowledge base is scalable and not limited by the availability of experts.
Raw predictions not good enough for fully automatic operation, but are plenty good enough to help taxonomy editors and other people do their job much faster.
Integrates lots of infromation
Predict entity types
Conc svd and pca are combinations
verify the null hypothesis that names are statistically independent from the two meaning proxies
SRL performance TRL 2
And inductive logic programming
Translate to natural language (sli)
One type of NLP annotation Labs is implementing is to mark up measurements – find the quantity, the unit, any tolerances, etc. We also normalize them to SI standards so measurements can be compared and searched. This is not novel research. However, we have not found prior work that attempts to detect the specific object and property being measured. We are using several domain-specific scenarios (mouse cancer, concrete additives, NLP algorithm accuracy, neuronal properties) to find ways that information is expressed. For mouse cancer, it is relatively easy to detect that a measurement is a dosage of a particular drug. But those patterns are of little use in the other scenarios.
This work has application to the h-graph – dosages, ages, weights, etc. are all important properties for the patient context. Cohort size and probability are important for the quality of evidence measures.