Bag of tricks for documents tagging, information extraction & analysis

Bag of Tricks for
Documents
Tagging,
Information
Extraction &
Analysis Marianna Petrova,
Data Scientist

Agenda
Business need
Paper documents tagging and info extraction
Cataloguer
Hand-written text detector
Fuzzy matching
Demo
Q&A

Business Area
Support in medical insurance claim
Expert services for personal injury attorneys
Medical analysis
Accidents reconstruction
Goal
Minimized use of human resources
Fast documents processing
Catalogue documents from user’s case
Medical records summary: focus on important
information

All medical records are fully artificial, i.e. generated intentionally for this report
from open-source datasets and abstract medical information.
All personal information appearing in this presentation is fictious.
No document is related any concrete person, and any resemblance to real
persons is purely coincidental.

PAPER DOCUMENTS TAGGING AND INFO EXTRACTION
NLP processingTesseract OCR Catalogue
Structured summary
if not
accurate
GAN enhancement
e-textpng tagged doc

CATALOGUER
Medical record
Medical bill
Medical 3
Vehicle 1
Vehicle 2
Vehicle 4
Vehicle 4
Legal 1
Legal 2
…
Legal 9
Tagged documents Tesseract OCR Text to numbers Classifier
e-text matrixtrain set

Cataloguer
• Unbalanced categories: enough samples (1000+) in medical and
legal categories, but the majority of other categories have only about
~100 samples.
• Intrinsic divergence and contradictoriness of the documents
• Intersection between the sections of different categories
• Semantic homogeneity in different categories
• The alien documents are categorized

Accuracy reported to the customer ~80%
Accuracy paradox
Legal 1 Legal 2
Arghhh!

Tagged documents Neural Network
HAND-WRITTEN TEXT DETECTOR
train set
Hanwritten or Non-
handwritten tag
Data Science exists not by accuracy alone

The hardest thing of all is to find a black cat
in a dark room, especially if there is no cat.”
Confucius

TEXT
MRI, CT…
MANUAL THERAPY /
CHYROPRACTIC REPORT
QUESTIONNAIRE
BLOOD TEST
POOR-QUALITY TEXT
Text extraction: not for the squeamish

TEXT EXTRACTION: FUZZY MATCHING
Patient’s information
- name
- weight and height
- DOB
- date of injury,
- dates of medical records
Text of medical
record from OCR
Medical records summary
- diagnoses
- medical history
- treatments and procedures
- ICD-10, ICD-9
Doctor’s name
Magic potion of fuzzy-
matching, NLP, rule-based
approach, and common sense

Fuzzy matching concept
• Looks for words (phrases) in text, which fuzzy
matches the search word and returns the
location of this word, threshold, and the
matching word (phrase) itself
• Adaptive threshold for words of different
length. 3- and 4-characters words are
matched directly. The longer the word
(phrase), the closer threshold to one.
• The text following the found word contains
relevant information
• Used for patient’s info and sections search in
medical records; simplifed version is used to
purify the extracted information.
Examples
DIAGNOSIS = DIAGNOSES, DIANOSIS,
DIAGNOSE but not DIAGNOSTIC
(mpression, 0.92, (0, 10), impression)
(impression, 0.92, (0, 10), impressio)
(surgical history, 0.92, (0, 10), sociai history)
Lessons Learned
• Inappropriate threshold results in
extraction of the irrelevant pieces of
information
• Selection of the appropriate search
words is vital (re:, subject, years old, Mr.)
• Spelling errors and quality of initial
document are often critical

… and a heart saddened by the chidings of Bessie, the nurse, and humbled by
the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.
The said Eliza, John, and Georgiana were now …
('PERSON')
('PERSON') ('PERSON')('PERSON')
(PRODUCT')('PERSON') (ORG')
John Smith is a pleasant 40-year-old man'
('PERSON')
john smith is a pleasant 40-year-old man'
(???')
Name extraction: where NER fails
SPACY
Stanford Сore NLP was additionally trained for names and medical records
extraction: worked better, but not always accurate

• Search words
• NER + exclusion techniques
• Frequency analysis
• Fuzzy match with names database
• Most frequent name and its frequency score
• Name with the highest sum of scores from many
documents is a real name
• For doctor’s name extraction, Stanford NER + direct match
with search words proved to be efficient.
Custom name extraction
261
124
373
129
382 390
266
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7
Namefrequency
Cases
Frequency of extracted patients' names from
case documents
Correct name Incorrect name

Medical sections extraction
• Search words according to HL7 Clinical
Document Architecture, extended
• Take next section after, and then next, next,
until ….something happens
• No assumptions for location of the anchor
words can be made
• In some cases, there is no other way to exclude
irrelevant info, but use exclusion words
(diagnostic studies, compression, etc.)

Medical sections refinement
Refinement principles
- split into simple sentences;
- fuzzy matching to filter phrases (according to
domain expert guidelines);
- exclusion of irrelevant statements.

ICD: critical for insurance company
International Classification of Diseases, a
system of codes with critical information about
epidemiology, managing health, and treating
conditions.
Insurance companies use ICD codes to classify
conditions and determine reimbursement.
Doctors mark diagnoses with ICD codes.
Insurance companies are strict about having
structured document, with diagnoses coded
appropriately. ICD-9: ~ 14,000 codes ICD-10: ~ 70,000 codes

ICD-9: 850.9 Concussion
ICD-9: 723.1 Cervalgia
How we extract ICD
Doctor writes “Cardiopulmonary disease”
Doctors mark diagnoses with ICD codes (reluctantly, flexible, as they feel today).
I27.9 - Pulmonary heart disease, unspecified

• Find ICD using regex
• Relevance validation. Check the string in extracted text:
- is readable
- is not part of irrelevant medical sections
- contains numeric and alphanumeric symbols
- is not weight, address, phone number, date, blood test, name,
• Extract the diagnosis formulation from ICD library
• Display ICD and the text in which ICD was found
How we extract ICD
Aim: to find a code and extract diagnosis formulation from ICD database.

6-page document
Artificial Medical Document Fuzzy Extraction Result

15-page document
Artificial Medical Document Fuzzy Extraction Result
ICD

Bag of tricks for documents tagging, information extraction & analysis

Recommended

Recommended

More Related Content

Similar to Bag of tricks for documents tagging, information extraction & analysis

Similar to Bag of tricks for documents tagging, information extraction & analysis (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)