1. MyHarmony is an application developed by the Malaysian Ministry of Health and MIMOS Bhd to generate statistics and clinical quality indicators from free-text clinical documents through natural language processing and coding with SNOMED CT terms.
2. In a case study, MyHarmony was able to accurately generate national cardiovascular disease registry statistics and key performance indicators from over 16,000 anonymized hospital discharge summaries through mapping terms to SNOMED CT codes.
3. The study demonstrated MyHarmony's ability to aggregate related terms through the SNOMED CT hierarchy to provide a more comprehensive analysis compared to simple string matching. This allows automated generation of timely statistics to support health planning and quality improvement.
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Generating Clinical Quality Indicators from Clinical Text
1. MyHarmony: Generating Statistics from Clinical Text for Monitoring Clinical Quality Indicators
Md. Khadzir Sheikh Ahmad1
; Mohd Syazrin Mohd Sakri1
; „Ismat Mohd Sulaiman1
; Syirahaniza Mohd
Salleh1
; Dickson Lukose2
; Omar Ismail1
; Abd Aziz Latip3
; Muhammad Aiman Mazlan3
.
1
Health Informatics Centre, Planning Division, Ministry of Health, Malaysia
2
GCS Agile Pty. Ltd.
3
MIMOS Bhd.
Abstract:
The Ministry of Health developed MyHarmony with MIMOS Bhd. as part of the Malaysian Health Data
Warehouse (MyHDW) initiative. MyHDW aims to be a trusted source of truth of comprehensive health
data structured for analysis. MyHarmony fulfills that criteria by enabling data and information from
unstructured form to be mined, such as texts, images, and sound. MyHarmony's first deliverable is the
ability to mine clinical texts using Natural Language Processing (NLP) with SNOMED CT as its
knowledge-base of clinical terms. MyHarmony is the engine the retrieves data and information into
computer processable form by assigning SNOMED CT codes, which can then be further analysed
statistically. MyHarmony is able to recognise and harmonise different terms that means the same. It also
understands context for a more accurate coding; such as negations (no, not known, unknown) and
conditionals (past history, symptoms of, previous). Using SNOMED CT, MyHarmony's ability is further
advanced by using subsumption technique for a more comprehensive statistical results. This study will
present a use case where clinical text from anonymized hospital discharge summaries can generate
clinical indicators using MyHarmony for health managers. An added benefit to the operational (hospital)
staff is the ability to produce such indicators is an efficient and timely manner by reducing workload for
data collection and submission. MyHarmony could be the new and improved way to provide important
statistical measures for evidence-based health planning, leading to improved healthcare services and
health as a whole.
Keywords:
MyHarmony; MyHDW; Text mining; Quality Indicators; SNOMED CT
1. Introduction:
The Ministry of Health developed MyHarmony with MIMOS Bhd. as part of the Malaysian Health Data
Warehouse (MyHDW) initiative. MyHDW aims to be a trusted source of truth of comprehensive health
data structured for analysis. MyHarmony is an application in the Malaysian Health Data Warehouse
(MyHDW) that aims to analyse semi-structured and unstructured data. The unstructured data can be in the
form of free-text, visual, audio and machine generated data. Unstructured data does not have
predetermined values and not stored in an organized manner to be analysed by a conventional data
warehouse. Therefore, other techniques need to be applied. MyHarmony aims to address this and be
included as part of MyHDW.
There were three (3) major deliverables in the conceptual stage. The first part refers to the development
and implementation of health terminology standards, namely SNOMED CT, which will be the knowledge
bases for MyHarmony. The second part was harmonization of the medical terminology to SNOMED CT
2. terms by way of mapping. The last part was about the development and implementation of MyHarmony
to show that the application can codify relevant terms in free-text using Natural Language Processing
(NLP) technique. The SNOMED CT codified data can then be analysed for information generation.
2. Methodology:
The development was first started in 2014 with the development of Cardiology Refset. Cardiology Refset
was the terminology reference for the MyHarmony engine during the harmonisation/mapping and
codification process. Cardiology Refset (version 1.0) was completed and released in 2014. It is a simple
reference set [1] containing about 600 terms related to Cardiology speciality including signs and
symptoms, diagnoses, procedures, body structures, medical devices and medications. It was delivered in
time to be tested on MyHarmony standalone system to generate National Cardiovascular Disease
(NCVD) registries.
The draft Refset and method was presented during IHTSDO meetings and Expo in succession on
September 2013, October 2013, and April 2014 to gain feedback from experts in the international
community. The finalized method was presented during SNOMED CT Expo, October 2014 [2] .
The Cardiology Refset was then expanded to include all cardiology related terms and Cardiology Refset
v1.1 was completed in July 2016 containing more than 6000 concepts. First, more than 300,000
SNOMED CT concepts (Fully Specified Names) were extracted and reviewed by PIK using eyeballing
technique. About 12,000 concepts that were believed to be related to Cardiology specialty was given to
the clinicians for review. The clinicians reduced the number of concepts to about 6,000. Additionally, the
Refset included local terms and common abbreviations which were mapped to existing concepts.
Next, the team utilise MyHarmony to generate the analysis. There were 4 main functions in MyHarmony:
1. Terminology Management– to allow user to upload the SNOMED CT International Release
content into MyHarmony, and upload SNOMED CT Refset in reference to the SNOMED CT
International Release.
2. Data Management– to allow user to upload the data that will be harmonized and codified. This
function also allows user to view the content of the data.
3. Codification Management– to allow user to codify the dataset according to the selected
SNOMED CT Refset and view the codified data for validation purpose.
4. Query Management– to allow user to explore the data by generating queries using Structured
Query Language (SQL).
The functions were arranged according to the work process. First, the SNOMED CT International
Release, SNOMED CT Cardiology Refset, and the dataset needs to be uploaded. Then, the dataset is
codified and saved. Using the Query Management, the codified data can then be explored via data
profiling and query generation.
For initial analysis, SNOMED CT International release version 20160731 was used as the Cardiology
Reference Set was developed using this version of SNOMED CT.
3. The team received a set of database from a hospital with cardiology service which consists of 16224
discharge summaries from year 2017. The database was then uploaded into MyHarmony. The personally
identified information (patient names, ID, and street address) were anonymised prior to codification and
analysis. The output is a codified dataset, which enable information processing and analysis by machines.
Using the Query Management, the codified data was then be explored via data profiling and query
generation.
3. Result:
The team conducted several data profiling queries to ensure that MyHarmony were able to capture the
data correctly. For example, the number of records by month between Raw data (MyHarmony without
SNOMED CT) and Harmonized data (MyHarmony with SNOMED CT) should return the same result.
Other examples of data profiling queries are the number of records by gender, by specialty, and by
ethnicity.
Next, the team developed queries required by the National Cardiovascular Disease (NCVD) registries and
compare the results with published registry reports. For example, querying the number of Ischaemic Heart
Disease (IHD) by gender shows 1:4 female to male ratio, which is a similar ratio in the registry reports.
Furthermore, the query also shows that Harmonized data captures more result compare to Raw data due to
SNOMED CT relationship structure, thus capturing all the subtypes of IHD and its synonyms or ways of
writing. The registry, however, only captures three (3) diagnosis due to its structured format, which are
ST Elevation Myocardial Infarction, Non-ST Elevation Myocardial Infarction, and Unstable Angina. This
trend and pattern comparison allow validation by the Clinicians and gains their buy-in in using
MyHarmony.
The team also tried to generate more queries required by the NCVD registry. However, it was limited by
the documentation in the discharge summary. Registry queries requires more detail information that may
often not documented in a discharge summary, such as information on smoking status and complications
of procedures.
After that, the team was tasked to generate National Cardiology Key Performance Indicators (KPIs).
MyHarmony are able to generate 7 out of the 8 KPIs (KPI 2 to 8). The first KPI was excluded because the
data are available at the clinic and not documented in inpatient discharge summaries. The Health
Information Framework (HIF) was developed for the 7 KPIs, which detailed out the inclusion and
exclusion criteria, the target, the formula, the terms used by MyHarmony, and query, and lastly a section
for additional notes.
Preliminary manual validation on the completeness and accuracy of codified data shows 90% precision
and 70% recall. The content of those records is complex as it does not follow grammar rules, and contains
a large number of short forms, abbreviations, acronyms and analogous terms (e.g., synd, ACS, CCS IV,
NYHA 2). One example of record is “2VD with RCA culprit lesion - Ad hoc PCI DES to RCA and LAD”
which is challenging to codify using approaches based on strict grammar. The revised version of
MyHarmony uses a different approach based on shallow parsing and the consideration of multiple
4. suitable combinations of words in a sentence. With further iterations and improvement in the mapping,
these challenges were overcome[3].
From the SNOMED CT codified database, the system was able to show a more accurate result during
analysis . This is because MyHarmony capitalises on the existing SNOMED CT relationships structure
between concepts. In this case, when querying “Number of Ischaemic Heart Disease cases per year”,
MyHarmony search the code and term for “Ischaemic heart disease”, its synonyms and accepted
abbreviations, and all the subtypes of Ischaemic heart disease such as all subtypes of “Myocardial
infarction” and “Angina”. MyHarmony aggregates these records resulting in a more accurate analysis.
Usually, the result where MyHarmony utilise SNOMED CT‟s relationship structure would show more
records. This is because Mi-Harmony was able to aggregate data not just through String match, but also
utilize the IS-A hierarchy structure in SNOMED CT. For example, querying “Ischemic heart disease” will
gather clinical records with synonymous terms like “Ischaemic heart disease” and “IHD”; as well as
clinical records containing all the subtypes of “Ischemic heart disease” such as “Myocardial Infarction”
and “Unstable angina”.
Context awareness such as negation and pasts events were also applied. For example, the term “No chest
pain”, “No known history of diabetes mellitus”, and “Symptoms of heart failure” will not be coded as the
presenting condition. Additionally, terms like “Previous history of”, “Previous admission of”, and
“Family history of” within the same sentence as a clinical condition will not be coded as the current
condition for the record.
4. Discussion and Conclusion:
When showcasing these abilities to the clinicians, the team agreed that MyHarmony was able to:
(i) Generate more information from free-text utilising the SNOMED CT structure, thus,
reducing the effort needed to collect data in a structured manner such as in a registry
and indicator reports;
(ii) Able to generate new information by retrospectively running new queries on old
discharge summary records; thus, reducing the effort and time to collect data in a
prospective manner when new questions arise, such as for indicator reports that often
change on a yearly basis;
(iii) Able to deliver information in a timelier fashion; thus clinicians and health managers
are able to plan and take action without waiting for a 1 to 3 yearly report;
(iv) Improve documentation of clinicians when they are aware of MyHarmony‟s ability
during roadshows.
Generating indicators for monitoring and evaluation can be a burden even for healthcare facilities
equipped with EHR. Conventionally, collecting data for indicators requires multiple data entries in
aggregated manner, with manual submission to central agencies, where the results are only published on a
yearly basis. Introducing MyHarmony may reduce these burdens. Capturing data from the source in an
automated way, i.e. free text documented by doctors, would reduce duplication of work and the amount of
resources to capture the data into manual form. Having the data in granular form would allow a more
5. dynamic analysis and prevents dishonesty. Information required, whether old or new information, can be
formulated and disseminated back to the clinicians and health managers in a timelier fashion.
MyHarmony has the potential to expand further in its implementation and technology. However, there are
still challenges to be addressed. Currently, MyHarmony has been developed to mine free-text for
Cardiology via a back-end approach. It uses a single version of SNOMED CT International. The team is
still researching the best approach to manage SNOMED CT versions and its codified data, which may
impact the resulting analysis in an inconsistent way. The team are also seeking international experience
for this matter.
Other challenges include researching a more efficient and effective method to develop SNOMED CT
Refsets. The initial method by referencing terms required by registries or indicators has been established.
However, expanding the SNOMED CT Refsets to include relevant terms for a specific clinical specialty
or domain needs to be refined further. Eye-balling technique to search the entire SNOMED CT content
have its strength and weaknesses. Even though it is a very thorough method, there are possibilities of
missed terms and very time consuming. Despite these challenges, the journey in developing MyHarmony
and the lessons learnt has allowed the team to refine the methods and processes to expand the use of
MyHarmony to other clinical specialties.
Analysis from unstructured data would hope to complement analysis from structured data (like census and
registries), with the additional benefit workload reduction to provide timelier, trusted, and dynamic
information.
References:
1. SNOMED CT Simple Reference Set:
https://confluence.ihtsdotools.org/display/DOCRFSPG/5.1.+Simple+Reference+Set
2. Mohd Sulaiman I, Sheikh Ahmad MK. SNOMED CT Cardiology Reference Set Development,
Malaysia. Proceedings of the SNOMED CT Implementation Showcase [Internet]. Amsterdam: IHTSDO;
2014. Retrieved from:
http://ihtsdo.org/fileadmin/user_upload/doc/showcase/show14/SnomedCtShowcase2014_Abstract_14058
.pdf
3. Abdul Manaf NA, Mohamed K, Lukose D. Harmonizing EHR Databases with SNOMED CT. Proceedings of the
SNOMED CT Implementation Showcase 2014 [Internet]. Amsterdam: IHTSDO; 2014.
https://confluence.ihtsdotools.org/display/FT/SNOMED+CT+Implementation+Showcase+2014