3. Business Area
Support in medical insurance claim
Expert services for personal injury attorneys
Medical analysis
Accidents reconstruction
Goal
Minimized use of human resources
Fast documents processing
Catalogue documents from user’s case
Medical records summary: focus on important
information
4. All medical records are fully artificial, i.e. generated intentionally for this report
from open-source datasets and abstract medical information.
All personal information appearing in this presentation is fictious.
No document is related any concrete person, and any resemblance to real
persons is purely coincidental.
5. PAPER DOCUMENTS TAGGING AND INFO EXTRACTION
NLP processingTesseract OCR Catalogue
Structured summary
if not
accurate
GAN enhancement
e-textpng tagged doc
6. CATALOGUER
Medical record
Medical bill
Medical 3
Vehicle 1
Vehicle 2
Vehicle 4
Vehicle 4
Legal 1
Legal 2
…
Legal 9
Tagged documents Tesseract OCR Text to numbers Classifier
e-text matrixtrain set
7. Cataloguer
• Unbalanced categories: enough samples (1000+) in medical and
legal categories, but the majority of other categories have only about
~100 samples.
• Intrinsic divergence and contradictoriness of the documents
• Intersection between the sections of different categories
• Semantic homogeneity in different categories
• The alien documents are categorized
9. Tagged documents Neural Network
HAND-WRITTEN TEXT DETECTOR
train set
Hanwritten or Non-
handwritten tag
Data Science exists not by accuracy alone
10.
11. The hardest thing of all is to find a black cat
in a dark room, especially if there is no cat.”
Confucius
12. TEXT
MRI, CT…
MANUAL THERAPY /
CHYROPRACTIC REPORT
QUESTIONNAIRE
BLOOD TEST
POOR-QUALITY TEXT
Text extraction: not for the squeamish
13. TEXT EXTRACTION: FUZZY MATCHING
Patient’s information
- name
- weight and height
- DOB
- date of injury,
- dates of medical records
Text of medical
record from OCR
Medical records summary
- diagnoses
- medical history
- treatments and procedures
- ICD-10, ICD-9
Doctor’s name
Magic potion of fuzzy-
matching, NLP, rule-based
approach, and common sense
14. Fuzzy matching concept
• Looks for words (phrases) in text, which fuzzy
matches the search word and returns the
location of this word, threshold, and the
matching word (phrase) itself
• Adaptive threshold for words of different
length. 3- and 4-characters words are
matched directly. The longer the word
(phrase), the closer threshold to one.
• The text following the found word contains
relevant information
• Used for patient’s info and sections search in
medical records; simplifed version is used to
purify the extracted information.
Examples
DIAGNOSIS = DIAGNOSES, DIANOSIS,
DIAGNOSE but not DIAGNOSTIC
(mpression, 0.92, (0, 10), impression)
(impression, 0.92, (0, 10), impressio)
(surgical history, 0.92, (0, 10), sociai history)
Lessons Learned
• Inappropriate threshold results in
extraction of the irrelevant pieces of
information
• Selection of the appropriate search
words is vital (re:, subject, years old, Mr.)
• Spelling errors and quality of initial
document are often critical
15. … and a heart saddened by the chidings of Bessie, the nurse, and humbled by
the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.
The said Eliza, John, and Georgiana were now …
('PERSON')
('PERSON') ('PERSON')('PERSON')
(PRODUCT')('PERSON') (ORG')
John Smith is a pleasant 40-year-old man'
('PERSON')
john smith is a pleasant 40-year-old man'
(???')
Name extraction: where NER fails
SPACY
Stanford Сore NLP was additionally trained for names and medical records
extraction: worked better, but not always accurate
16. • Search words
• NER + exclusion techniques
• Frequency analysis
• Fuzzy match with names database
• Most frequent name and its frequency score
• Name with the highest sum of scores from many
documents is a real name
• For doctor’s name extraction, Stanford NER + direct match
with search words proved to be efficient.
Custom name extraction
261
124
373
129
382 390
266
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7
Namefrequency
Cases
Frequency of extracted patients' names from
case documents
Correct name Incorrect name
17. Medical sections extraction
• Search words according to HL7 Clinical
Document Architecture, extended
• Take next section after, and then next, next,
until ….something happens
• No assumptions for location of the anchor
words can be made
• In some cases, there is no other way to exclude
irrelevant info, but use exclusion words
(diagnostic studies, compression, etc.)
18. Medical sections refinement
Refinement principles
- split into simple sentences;
- fuzzy matching to filter phrases (according to
domain expert guidelines);
- exclusion of irrelevant statements.
20. ICD: critical for insurance company
International Classification of Diseases, a
system of codes with critical information about
epidemiology, managing health, and treating
conditions.
Insurance companies use ICD codes to classify
conditions and determine reimbursement.
Doctors mark diagnoses with ICD codes.
Insurance companies are strict about having
structured document, with diagnoses coded
appropriately. ICD-9: ~ 14,000 codes ICD-10: ~ 70,000 codes
22. ICD-9: 850.9 Concussion
ICD-9: 723.1 Cervalgia
How we extract ICD
Doctor writes “Cardiopulmonary disease”
Doctors mark diagnoses with ICD codes (reluctantly, flexible, as they feel today).
I27.9 - Pulmonary heart disease, unspecified
23. • Find ICD using regex
• Relevance validation. Check the string in extracted text:
- is readable
- is not part of irrelevant medical sections
- contains numeric and alphanumeric symbols
- is not weight, address, phone number, date, blood test, name,
• Extract the diagnosis formulation from ICD library
• Display ICD and the text in which ICD was found
How we extract ICD
Aim: to find a code and extract diagnosis formulation from ICD database.
Documents without specific format, pre-defined or standardized structure
High-quality pdfs and low-resolution scans and faxes
Tesseract-ocr & Apache Lucene spellchecker
Apache cTakes= clinical Text Analysis and Knowledge Extraction System
Plain text and lxml-structured text
Page-wise processing
Errors in poor-quality documents
OCR limitations are intractable and they determine the extraction efficiency
Documents assigned to categories:
obtained from customer (domain expert required)
Why not embeddings ?
ML not by the ML alone
A user can view pages on which handwritten page is present
Analyze the document page by page, threshold = 0.95
Detects if a handwritten paragraph of about 30% page area is present on the page.
Signatures, ticks, few words passages are not detected as handwritten records