Principles of Health Informatics: Representing medical knowledge
1. Lecture 11: Representing medical knowledge
Dr. Martin Chapman
Principles of Health Informatics (7MPE1000). https://martinchapman.co.uk/teaching
2. Lecture structure
1. Terminologies: terms, groups, hierarchies and composition
2. Clinical terminologies and coding
3. Terminologies as models
4. Natural language processing
3. Learning outcomes
1. Be able to define terminologies, and related concepts such as
composition
2. Understand the concept of a clinical code
3. Understand how the limitations of models pass to terminologies
4. Be able to list and define the steps in named entity recognition
To understand how terminologies and natural language processing
support interventions.
It all comes back to interventionsâŚ
5. Terms
Previously, we saw how a terminology (a language), and the terms it
contains, provides us with a set of labels that we can apply to
symbols in the world.
In turn this allowed us to represent the state of the world.
Term(s)
Propellers
Plane
(has)
State
6. Terms vs concepts
In certain cases, there may be a single term commonly used for a symbol.
For example, propellers are mostly just known as propellers.
However, in other cases, there may be multiple terms for a given symbol.
For example, a plane can be referred to as one of a number of different
things. This flexibility is to be expected, but, in a terminology, we must
anchor these terms to a single concept for consistency.
Concept
Propellers
Sometimes these two ideas will
be conflated (including by
me!), but they are distinct.
Plane
We may just choose
one of the terms for
our concept.
Term(s)
Plane, aeroplane,
airliner, aircraft
Propellers
7. Aside: The Great British Bread Debate
For more evidence that
there can be many terms
for the same conceptâŚ
8. Groups
Group
Aviation
If we need to reference our terms and concepts collectively, we can
also group them.
For example, the concept of propellers and planes may be collected
together in an aviation group.
Concept
Propellers
Plane
Term(s)
Propellers
Plane, aeroplane,
airliner, aircraft
9. Problem: Too many terms
If we think about all the terms
weâve met so far in relation to
a plane⌠there are lots
Vehicles
Air
Land
Plane
Seaplane
Sinks
Water landing
Sea
10. Recall: Search space
In Lecture 4, we saw how we can use a hierarchical structure to
organise a search space. We can use a similar structure to organise
and connect our terms:
Root node
Leaf node
e.g. Patient Encounter
e.g. Individual diagnosis
12. Hierarchy: Organisation
This structure is neat.
It permits concept-driven exploration, which is akin to the heuristic
search techniques we saw previously. If I am to look for a plane I
know I am to explore the descendants of air, for example.
Conversely, if we did not know what planes were, this organisation
would tell us that planes were a type of air vehicle (and not a type of
land vehicle) simply from the structure of the terms themselves.
In other words, our hierarchy helps to add meaning to our terms.
13. Hierarchy: Connections
We can further add meaning to our terms by dictating that each link
(edge) in our hierarchy expresses a particular type of relationship
between the two terms on each end of the connection.
Seaplane
Plane
Is-A
Propeller
Plane
Part-whole
Sinks
Water landing
Causal
Useful if weâre
interested in
understanding
the type of a
given term in
our hierarchy.
Useful if weâre
interested in
understanding
dependencies
between terms.
Useful if weâre
interested in
understanding
what else is true
if we observe a
given term.
The type of link between the
nodes in our hierarchy needs to
be clearly defined to avoid
confusion.
If a terminology
mixes link types, we
call it multi-axial
14. Properties
We may not just have terms in
our terminology, but also
properties.
Properties provide more
information about the term.
When we organise our terms
into a hierarchy with links,
properties are inherited across
these links (i.e. pass from one
term to another).
Vehicles
Air Land
Plane
Seaplane
Sea
Wings = 1
Wings = 1
15. Problem: Too many terms (again)
As terminologies grow and more
terms are added, we once again â
even with our hierarchical structure
â end up with an unwieldy number
of terms and, in some cases, even
potential term overlap (in a way
that canât be reconciled using a
unifying concept).
Vehicles
Air Land
Plane
Seaplane
Sea
Toy plane
Model plane Biplane
Cargo plane
16. Composition
Another school of
thought is to, instead,
agree on a fixed set of
basic terms, and to
compose newer, more
complex terms from
these basic terms.
But we need rules for
how terms can be
composedâŚ
Vehicles
Air Land
Plane
Sea
Seaplane
17. Recall: Ontology
We saw previously how ontologies provide us with such restrictions.
We can also introduce
restrictions using our âis-aâ
relationships.
18. Composition and cost
Constructing a terminology such that its terms can be composed (post-
coordinated), including enforcing rules via something like an ontology,
obviously has a higher initial cost.
But as terminologies grow, this cost is ultimately less than a pre-
coordinated terminology (where terms are simply added to the
terminology), owing to the fact that hierarchically related terms may need
to be updated, and error checking may be complex.
We saw this idea in Lecture 3, when we observed that task-oriented and
placeholder-oriented structures have a high initial effort associated with
them, but can ultimately be beneficial in the long run.
20. Clinical terminologies
Much like the aviation domain, in the clinical domain we need a
language that provides us with a consistent way to represent, for
example, the state of a patient.
Clinical terminologies provide us with this, and have all the same
properties of general terminologies that we have just seen.
The terms included typically relate to patient diagnosis and
procedures performed.
Note: While clinical terminologies are technically distinct from medical terminologies
(which have a broader remit), you may find the two used interchangeably.
Similarly, you may also see terminologies referred to as classification systems.
22. Clinical codes
We saw earlier than concepts allow us to remain consistent in the
face of varying terms.
Clinical terminologies take this idea further by assigning a unique
code to each concept, which is often used instead.
Type 2
Diabetes,
Diabetes, T2DM
Type 2
Diabetes
Cardio-
metabolic
Group
Concept
Term(s)
44054006
Code
Note: Again, you may see references to medical
codes (technically a superset), or indeed diagnosis
codes (technically a subset), but these may also be
used interchangeably.
24. Coding
Given what weâve seen, the process of coding is thus labelling, for
example, the condition(s) a patient has.
Labels are chosen from the terminology (rather than at random) to
improve consistency, and are stored somewhere, usually a patientâs
Electronic Health Record (EHR).
Good clinicians (đ) will code directly, but often coding involves
interpreting existing medical text from the EHR.
Once codes are derived they are useful for the future interpretation of a
patientâs state, and for activities like auditing.
25. Coding errors
In Lecture 3, we saw the concept of false positives and false
negatives.
The same is true of the coding process, when codes are incorrectly
assigned, or not assigned, to patients.
Coding errors may occur when an EHR is incorrect or incomplete; or
when the coder themselves does not correctly interpret the primary
reason for an encounter, does not have the correct expertise or
makes an entry error.
26. Computers and coding
One possible way to address
these issues is by introducing
computers into the coding
process.
For example, a computer may
help search free text, it may
assist during the actual entry of
the original information
(restricting inputs, for example)
or it may run the whole coding
from the free-text process itself
(more later).
28. Terminologies as models
Recall that a terminology is (or forms part of) a data model.
As such, terminologies have the limitations associated with
(information) models like these.
29. Terminology limitations: Simplification
Models are always a simplification of the phenomenon they abstract.
There is, therefore, no such thing as a single, universal terminology
for a given domain, which allows us to adequately label all the
symbols we encounter.
This is compounded by the fact that not everyone will apply the
same label to a given symbol (the symbol grounding problem).
30. Terminology limitations: Simplification
Another outcome of the simplification process is that the nuances of
concepts are often lost when represented in a terminology:
1. Concepts rarely have a pure definition: one would typically
consider someone over 70 years old as elderly, but âelderlyâ can
also technically include pregnant women over a certain age.
2. As a result of the above, the meaning of concepts is often
context-dependent. The term elderly in a maternal context will
mean something other than it might do typically.
31. Terminology limitations: Snapshot and
Purposive
A snapshot: Models are always a snapshot of the things they
represent, and this snapshot becomes less relevant over time.
Therefore, terminologies cannot capture how concepts change over
time.
Purposive: Models are built for a particular purpose, so it thus
follows that a terminology exists for a particular purpose (e.g.
labelling certain types of vehicles), and cannot necessary be used for
other purposes.
32. Multiple clinical terminologies
As there is no such thing as a universal terminology, there is not just
a single clinical terminology (like SNOMED CT, (Systematized
Nomenclature of Medicine, Clinical Terms) as weâve seen), but
instead multiple clinical terminologies.
Often, these terminologies will try and cover different aspects of the
domain. For example, ICD-10 (International Classification of
Disease, Version 10) focuses more on disease, while CPT (Current
Procedural Terminology) focusses more on interventions taken.
However, in many cases the same concept will be covered in multiple
terminologies.
33. Terminology Mapping
As such, if one institution, which adopts a particular terminology,
wants to interpret the data from another institution, which adopts a
different terminology, there will be a need to map from one set of
codes code to another.
34. Composition and mapping
While feasible in practice, mapping can, due to the issues with
terminologies weâve seen, be a difficult process.
However if two terminologies have been created by composing terms
from a single terminology, this mapping is more straightforward, as
they have a common core.
Pre-coordinated Post-coordinated
36. Automated coding
We saw earlier that computers can help solve issues with the coding
process.
One way in which they can do this is to take over the coding process
entirely.
We can generalise this to a computer applying labels from any
terminology to any text in order take make that text computable.
Letâs return to our plane terminologyâŚ
37. Named entity recognition
We call this process named entity recognition, a subfield of natural
language processing.
A seaplane is a powered fixed-wing
aircraft capable of taking off from
water. It can also land on water.
There is often an air of adventure
about those who board these vehicles.
How do we apply
labels from our
terminology to this
text?
38. Aside: Programming code
Over the next few slides, I shall reinforce some of the ideas I show
using programming code.
Programming code represents a series of steps to solve a problem.
If this is ultimately more confusing, you are welcome to skip these
slides.
39. Pre-pass: Stop word removal
Before we do anything else, we need to remove words we arenât
interested in (e.g. articles). We refer to these words as stop words.
A seaplane is a powered fixed-wing
aircraft capable of taking off from
water. It can also land on water.
There is often an air of adventure
about those who board seaplanes.
40. In code: Stop word removal
import nltk
from nltk.corpus import stopwords
with open('text.txt','r') as file:
text = file.read()
text = [word.replace(".", "").replace("n", "").lower() for word
in text.split(" ")]
# get a list of stop words
stop = stopwords.words('englishâ)
text_no_stop = list(filter(lambda word: not word in stop, text));
print(text_no_stop)
41. Pre-pass: Stop word removal
'seaplane', 'powered', 'fixed-wing',
'aircraft', 'capable', 'taking', 'water',
'also', 'land', 'water', 'often', 'air',
'adventure', 'board', 'seaplanes'
What weâre
ultimately left with
after this process is
a âbag of wordsâ.
42. First pass: âBag of wordsâ (basic)
To start, we might simply find words in the text that are an exact
match for those from our terminology.
'seaplane', 'powered', 'fixed-wing',
'aircraft', 'capable', 'taking', 'water',
'also', 'land', 'water', 'often', 'air',
'adventure', 'board', 'seaplanes'
43. 'seaplane', 'powered', 'fixed-wing',
'aircraft', 'capable', 'taking', 'water',
'also', 'land', 'water', 'often', 'air',
'adventure', 'board', 'seaplanes'
First pass: âBag of wordsâ (basic) - Problem
But weâve immediately hit a snag: what about plurals?
44. Morphology
We often find groups of words that all refer to same idea, only with
slight grammatical differences.
For example fly, flies, flew, flown, flying all refer to the idea of
moving through the air.
These sets of words are known as lexemes, and we often choose a
single word, or a lemma, to represent the whole set when, for
example, listing words in a dictionary, e.g. fly.
Morphological analysis is, in part, the process of finding this
canonical (or accepted base) form of a word.
45. Stemming vs. Lemmatization
When conducting named entity recognition, it is often important not
to look at words directly, but to instead look at their lemmas.
This is because it is more likely to be the lemma that is listed in a
terminology.
There are two approaches to deriving lemmas. The first, stemming,
is simple, and involves removing suffixes from a word, while the
second, lemmatization, follows more complex processes to derive a
lemma.
46. In code: Stemming vs. Lemmatization
Due to its simplicity, the process of stemming can often cause
issues. As such lemmatization may be preferred.
# Porter is a particular algorithm for stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("seaplanes"))
##
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("seaplanes"))
seaplan
seaplane
47. 'seaplane', 'powered', 'fixed-wing',
'aircraft', 'capable', 'take', 'water',
'also', 'land', 'water', 'often', 'air',
'adventure', 'board', 'seaplane'
Second pass: âBag of wordsâ (Stemming)
With our words stemmed, we can now properly identify both
instances of the seaplane.
48. Exercise: Uses of word-level processing
Remove stop words and create lemmas from the following text:
Patient has a pain in her left arm. This issue has
been present for several days. A treatment has been
prescribed accordingly.
Note: For stop words, there is no perfect answer here. Different people/systems have different
interpretations of what constitutes a stop word. And this isnât always based on grammar.
49. Exercise: Uses of word-level processing
Remove stop words and create lemmas from the following text:
Patient has a pain in her left arm. This issue has
been present for several days. A treatment has been
prescribed accordingly.
'patient', 'pain', 'left', 'arm', 'issue', 'present',
'several', 'day', 'treatment', 'prescribe', 'accordingly'
Note: For stop words, there is no perfect answer here. Different people/systems have different
interpretations of what constitutes a stop word. And this isnât always based on grammar.
50. Named entity recognition process
There are several different steps to the named entity recognition
process:
Label(s)
Word-
level
processing
Text
51. Second pass: âBag of wordsâ (Stemming) - problem
We might decide to label land in the text from our terminology.
However this would be incorrect: in the text, land refers to the action
of landing, whereas our terminology refers to a type of vehicle.
'seaplane', 'powered', 'fixed-wing',
'aircraft', 'capable', 'take', 'water',
'also', 'land', 'water', 'often', 'air',
'adventure', 'board', 'seaplane'
52. Part-of-speech (POS) tagging
Our bag of words approach doesnât allow us to appreciate that
words appear in sentences, and each have a different grammatical
role (e.g. verbs and nouns).
The process of determining the role each word plays in a sentence is
known as part-of-speech (POS) tagging.
It is important to understand whether a word is a noun or a verb, for
example, so we can correctly label entities from our terminology.
54. Aside: Markov models
Under the hood, POS tagging is often supported by something known as
a (Hidden) Markov Model.
Because there are multiple grammatical roles a word can have depending
on the sentence, this approach combines rules (e.g. nouns typically follow
adjectives) and frequency information (e.g. how often one type of word is
followed by another) to assign probabilistically.
Rules
Frequency
Markov models were also
looked at in the context of
decision support systems.
55. Third pass: POS tagging
Now we know not to label land in our text (it is a verb, whereas our
terminology uses it as a noun).
'seaplane', 'powered', 'fixed-wing',
'aircraft', 'capable', 'taking', 'water',
'also', 'land', 'water', 'seaplanes',
'divided', 'different', 'categories',
'based', 'technological',
'characteristics'
56. Exercise: Use of syntax analysis
Identify any potential sources of syntactic ambiguity in our text.
Patient has a pain in her left arm. This issue has
been present for several days. A treatment has been
prescribed accordingly.
57. Exercise: Use of syntax analysis
Identify any potential sources of syntactic ambiguity in our text.
Patient has a pain in her left arm. This issue has
been present for several days. A treatment has been
prescribed accordingly.
noun
noun: left; noun: the left; noun: Left; noun: the
Left
the left-hand part, side, or direction.
"turn to the left"
adjective
adjective: left
on, towards, or relating to the side of a
human body or of a thing that is to the west
when the person or thing is facing north.
"her left eye"
58. Named entity recognition
There are several different steps to the named entity recognition
process:
Label(s)
Word-
level
processing
Analysis of
syntactic
structures
Text
59. Third pass: POS tagging â problem
The word âairâ in our text matches a term in our terminology, and
has a matching grammatical form (noun), but means something
different in the text.
A seaplane is a powered fixed-wing
aircraft capable of taking off from
water. It can also land on water.
There is often an air of adventure
about those who board these vehicles.
60. Final pass: Onotology application
Even with tools like stemming and POS tagging, our named entity
recognition process likely isnât perfect.
It is at this point that the use of ontologies, which provide us with
more semantic context, can come in to play to tell us, for example,
the two different meanings of the word air.
61. Named entity recognition
There are several different steps to the named entity recognition
process:
Text
Word-
level
processing
Analysis of
syntactic
structures
Use of
ontological
knowledge
Label(s)
62. Back to codingâŚ
Hopefully itâs clear how this same procedure can be applied to
medical text.
We can automatically identify labels, and in turn this can tell us
something about the state of the patient being described in a
computable way.
Letâs look at an example of this processâŚ
64. It all comes back to interventionsâŚ
If we can automatically interpret (i.e. apply labels to) a patientâs
EHR using natural language processing â which relies on a
terminology and all the concepts that come with it â then the
appearance of specific words can trigger alerts, and in turn inform a
clinician that an intervention is required, or administer an
intervention automatically.
65. Summary
Terminologies are languages that allow us to represent the state of
the world.
Terminologies in a clinical context allow us to attribute codes to
patients based upon things such as observed conditions, and record
these in their EHR.
There is no such thing as a universal clinical terminology, so different
terminologies exist for different domains.
Natural language processing operates in different stages, and levels
of complexity, to assign labels to text.
66. References and Images
Enrico Coiera. Guide to Health Informatics (3rd ed.). CRC Press, 2015.
https://medcat.rosalind.kcl.ac.uk/
https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
https://etn-sas.eu/2020/09/23/part-of-speech-tagging-using-hidden-markov-models/
https://www.lego.com/en-gb/service/buildinginstructions/3178
http://angalmond.blogspot.com/2018/03/in-which-i-feel-little-barmy.html
https://www.healthline.com/health/ozone-therapy
https://termbrowser.nhs.uk/
https://www.riomed.com/electronic-patient-records-impact-on-healthcare-industry/
http://www.storagetwo.com/blog/2019/1/greenwich-kids-learn-to-code