SlideShare a Scribd company logo
1 of 27
Bag of Tricks for
Documents
Tagging,
Information
Extraction &
Analysis Marianna Petrova,
Data Scientist
Agenda
Business need
Paper documents tagging and info extraction
Cataloguer
Hand-written text detector
Fuzzy matching
Demo
Q&A
Business Area
Support in medical insurance claim
Expert services for personal injury attorneys
Medical analysis
Accidents reconstruction
Goal
Minimized use of human resources
Fast documents processing
Catalogue documents from user’s case
Medical records summary: focus on important
information
All medical records are fully artificial, i.e. generated intentionally for this report
from open-source datasets and abstract medical information.
All personal information appearing in this presentation is fictious.
No document is related any concrete person, and any resemblance to real
persons is purely coincidental.
PAPER DOCUMENTS TAGGING AND INFO EXTRACTION
NLP processingTesseract OCR Catalogue
Structured summary
if not
accurate
GAN enhancement
e-textpng tagged doc
CATALOGUER
Medical record
Medical bill
Medical 3
Vehicle 1
Vehicle 2
Vehicle 4
Vehicle 4
Legal 1
Legal 2
…
Legal 9
Tagged documents Tesseract OCR Text to numbers Classifier
e-text matrixtrain set
Cataloguer
• Unbalanced categories: enough samples (1000+) in medical and
legal categories, but the majority of other categories have only about
~100 samples.
• Intrinsic divergence and contradictoriness of the documents
• Intersection between the sections of different categories
• Semantic homogeneity in different categories
• The alien documents are categorized
Accuracy reported to the customer ~80%
Accuracy paradox
Legal 1 Legal 2
Arghhh!
Tagged documents Neural Network
HAND-WRITTEN TEXT DETECTOR
train set
Hanwritten or Non-
handwritten tag
Data Science exists not by accuracy alone
The hardest thing of all is to find a black cat
in a dark room, especially if there is no cat.”
Confucius
TEXT
MRI, CT…
MANUAL THERAPY /
CHYROPRACTIC REPORT
QUESTIONNAIRE
BLOOD TEST
POOR-QUALITY TEXT
Text extraction: not for the squeamish
TEXT EXTRACTION: FUZZY MATCHING
Patient’s information
- name
- weight and height
- DOB
- date of injury,
- dates of medical records
Text of medical
record from OCR
Medical records summary
- diagnoses
- medical history
- treatments and procedures
- ICD-10, ICD-9
Doctor’s name
Magic potion of fuzzy-
matching, NLP, rule-based
approach, and common sense
Fuzzy matching concept
• Looks for words (phrases) in text, which fuzzy
matches the search word and returns the
location of this word, threshold, and the
matching word (phrase) itself
• Adaptive threshold for words of different
length. 3- and 4-characters words are
matched directly. The longer the word
(phrase), the closer threshold to one.
• The text following the found word contains
relevant information
• Used for patient’s info and sections search in
medical records; simplifed version is used to
purify the extracted information.
Examples
DIAGNOSIS = DIAGNOSES, DIANOSIS,
DIAGNOSE but not DIAGNOSTIC
(mpression, 0.92, (0, 10), impression)
(impression, 0.92, (0, 10), impressio)
(surgical history, 0.92, (0, 10), sociai history)
Lessons Learned
• Inappropriate threshold results in
extraction of the irrelevant pieces of
information
• Selection of the appropriate search
words is vital (re:, subject, years old, Mr.)
• Spelling errors and quality of initial
document are often critical
… and a heart saddened by the chidings of Bessie, the nurse, and humbled by
the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.
The said Eliza, John, and Georgiana were now …
('PERSON')
('PERSON') ('PERSON')('PERSON')
(PRODUCT')('PERSON') (ORG')
John Smith is a pleasant 40-year-old man'
('PERSON')
john smith is a pleasant 40-year-old man'
(???')
Name extraction: where NER fails
SPACY
Stanford Сore NLP was additionally trained for names and medical records
extraction: worked better, but not always accurate
• Search words
• NER + exclusion techniques
• Frequency analysis
• Fuzzy match with names database
• Most frequent name and its frequency score
• Name with the highest sum of scores from many
documents is a real name
• For doctor’s name extraction, Stanford NER + direct match
with search words proved to be efficient.
Custom name extraction
261
124
373
129
382 390
266
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7
Namefrequency
Cases
Frequency of extracted patients' names from
case documents
Correct name Incorrect name
Medical sections extraction
• Search words according to HL7 Clinical
Document Architecture, extended
• Take next section after, and then next, next,
until ….something happens
• No assumptions for location of the anchor
words can be made
• In some cases, there is no other way to exclude
irrelevant info, but use exclusion words
(diagnostic studies, compression, etc.)
Medical sections refinement
Refinement principles
- split into simple sentences;
- fuzzy matching to filter phrases (according to
domain expert guidelines);
- exclusion of irrelevant statements.
BUT
ICD: critical for insurance company
International Classification of Diseases, a
system of codes with critical information about
epidemiology, managing health, and treating
conditions.
Insurance companies use ICD codes to classify
conditions and determine reimbursement.
Doctors mark diagnoses with ICD codes.
Insurance companies are strict about having
structured document, with diagnoses coded
appropriately. ICD-9: ~ 14,000 codes ICD-10: ~ 70,000 codes
ICD explained
ICD-9: 850.9 Concussion
ICD-9: 723.1 Cervalgia
How we extract ICD
Doctor writes “Cardiopulmonary disease”
Doctors mark diagnoses with ICD codes (reluctantly, flexible, as they feel today).
I27.9 - Pulmonary heart disease, unspecified
• Find ICD using regex
• Relevance validation. Check the string in extracted text:
- is readable
- is not part of irrelevant medical sections
- contains numeric and alphanumeric symbols
- is not weight, address, phone number, date, blood test, name,
• Extract the diagnosis formulation from ICD library
• Display ICD and the text in which ICD was found
How we extract ICD
Aim: to find a code and extract diagnosis formulation from ICD database.
DEMO
6-page document
Artificial Medical Document Fuzzy Extraction Result
15-page document
Artificial Medical Document Fuzzy Extraction Result
ICD
Q & A

More Related Content

Similar to Bag of tricks for documents tagging, information extraction & analysis

Healthstory Enabling The Emr Dictation To Clinical Data
Healthstory Enabling The Emr   Dictation To Clinical DataHealthstory Enabling The Emr   Dictation To Clinical Data
Healthstory Enabling The Emr Dictation To Clinical DataNick van Terheyden
 
Healthstory Enabling The Emr - Dictation To Clinical Data
Healthstory Enabling The Emr - Dictation To Clinical DataHealthstory Enabling The Emr - Dictation To Clinical Data
Healthstory Enabling The Emr - Dictation To Clinical DataNick van Terheyden
 
Speech Understanding – The Key To Unlocking Clinical Knowledge Delivering Sa...
Speech Understanding – The Key To Unlocking Clinical Knowledge  Delivering Sa...Speech Understanding – The Key To Unlocking Clinical Knowledge  Delivering Sa...
Speech Understanding – The Key To Unlocking Clinical Knowledge Delivering Sa...Nick van Terheyden
 
Medical records departments in hospital (MRD)
Medical records departments  in hospital (MRD)Medical records departments  in hospital (MRD)
Medical records departments in hospital (MRD)drparul6375
 
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj Katwala
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj KatwalaHXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj Katwala
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj KatwalaHxRefactored
 
Health research, clinical registries, electronic health records – how do they...
Health research, clinical registries, electronic health records – how do they...Health research, clinical registries, electronic health records – how do they...
Health research, clinical registries, electronic health records – how do they...Koray Atalag
 
DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...
DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...
DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...ijsptm
 
د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة D...
د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة  D...د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة  D...
د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة D...د حاتم البيطار
 
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...Perficient, Inc.
 
Running head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docxRunning head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docxsusanschei
 
Speech Understanding Dictation To Clinical Data - TEPR 2009
Speech Understanding   Dictation To Clinical Data - TEPR 2009Speech Understanding   Dictation To Clinical Data - TEPR 2009
Speech Understanding Dictation To Clinical Data - TEPR 2009Nick van Terheyden
 
ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...
ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...
ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...IJDKP
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsMMS Holdings
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM csandit
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
OpenEHR modeling case studies in China
OpenEHR modeling case studies in ChinaOpenEHR modeling case studies in China
OpenEHR modeling case studies in Chinaxudong_lu
 

Similar to Bag of tricks for documents tagging, information extraction & analysis (20)

Healthstory Enabling The Emr Dictation To Clinical Data
Healthstory Enabling The Emr   Dictation To Clinical DataHealthstory Enabling The Emr   Dictation To Clinical Data
Healthstory Enabling The Emr Dictation To Clinical Data
 
Healthstory Enabling The Emr - Dictation To Clinical Data
Healthstory Enabling The Emr - Dictation To Clinical DataHealthstory Enabling The Emr - Dictation To Clinical Data
Healthstory Enabling The Emr - Dictation To Clinical Data
 
Speech Understanding – The Key To Unlocking Clinical Knowledge Delivering Sa...
Speech Understanding – The Key To Unlocking Clinical Knowledge  Delivering Sa...Speech Understanding – The Key To Unlocking Clinical Knowledge  Delivering Sa...
Speech Understanding – The Key To Unlocking Clinical Knowledge Delivering Sa...
 
Medical records departments in hospital (MRD)
Medical records departments  in hospital (MRD)Medical records departments  in hospital (MRD)
Medical records departments in hospital (MRD)
 
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj Katwala
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj KatwalaHXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj Katwala
HXR 2016: Data Insights: Mining, Modeling, and Visualizations- Niraj Katwala
 
Health research, clinical registries, electronic health records – how do they...
Health research, clinical registries, electronic health records – how do they...Health research, clinical registries, electronic health records – how do they...
Health research, clinical registries, electronic health records – how do they...
 
Medical Transcriptionist Review reduces Errors in NLP-EHR Documents
Medical Transcriptionist Review reduces Errors in NLP-EHR DocumentsMedical Transcriptionist Review reduces Errors in NLP-EHR Documents
Medical Transcriptionist Review reduces Errors in NLP-EHR Documents
 
DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...
DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...
DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDIC...
 
د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة D...
د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة  D...د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة  D...
د حاتم البيطار استشاري وجراح الفم والاسنان 01005684344 اتصل للحجز بالعيادة D...
 
Data in Research
Data in ResearchData in Research
Data in Research
 
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
 
Running head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docxRunning head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docx
 
Speech Understanding Dictation To Clinical Data - TEPR 2009
Speech Understanding   Dictation To Clinical Data - TEPR 2009Speech Understanding   Dictation To Clinical Data - TEPR 2009
Speech Understanding Dictation To Clinical Data - TEPR 2009
 
ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...
ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...
ARTIFICIAL INTELLIGENCE BASED DATA GOVERNANCE FOR CHINESE ELECTRONIC HEALTH R...
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health Records
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM T OP  K-O PINION  D ECISIONS  R ETRIEVAL IN  H EALTHCARE  S YSTEM
T OP K-O PINION D ECISIONS R ETRIEVAL IN H EALTHCARE S YSTEM
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
NLP to Enhance Your Hospital Documentation
NLP to Enhance Your Hospital DocumentationNLP to Enhance Your Hospital Documentation
NLP to Enhance Your Hospital Documentation
 
OpenEHR modeling case studies in China
OpenEHR modeling case studies in ChinaOpenEHR modeling case studies in China
OpenEHR modeling case studies in China
 

More from Lviv Startup Club

Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Lviv Startup Club
 
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Lviv Startup Club
 
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Lviv Startup Club
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Lviv Startup Club
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Lviv Startup Club
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Lviv Startup Club
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Lviv Startup Club
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Lviv Startup Club
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Lviv Startup Club
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Lviv Startup Club
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Lviv Startup Club
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Lviv Startup Club
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Lviv Startup Club
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Lviv Startup Club
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Lviv Startup Club
 
Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)Lviv Startup Club
 
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...Lviv Startup Club
 
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)Lviv Startup Club
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Lviv Startup Club
 
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Lviv Startup Club
 

More from Lviv Startup Club (20)

Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
 
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
 
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 
Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)
 
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
 
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
 
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
 

Recently uploaded

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

Bag of tricks for documents tagging, information extraction & analysis

  • 1. Bag of Tricks for Documents Tagging, Information Extraction & Analysis Marianna Petrova, Data Scientist
  • 2. Agenda Business need Paper documents tagging and info extraction Cataloguer Hand-written text detector Fuzzy matching Demo Q&A
  • 3. Business Area Support in medical insurance claim Expert services for personal injury attorneys Medical analysis Accidents reconstruction Goal Minimized use of human resources Fast documents processing Catalogue documents from user’s case Medical records summary: focus on important information
  • 4. All medical records are fully artificial, i.e. generated intentionally for this report from open-source datasets and abstract medical information. All personal information appearing in this presentation is fictious. No document is related any concrete person, and any resemblance to real persons is purely coincidental.
  • 5. PAPER DOCUMENTS TAGGING AND INFO EXTRACTION NLP processingTesseract OCR Catalogue Structured summary if not accurate GAN enhancement e-textpng tagged doc
  • 6. CATALOGUER Medical record Medical bill Medical 3 Vehicle 1 Vehicle 2 Vehicle 4 Vehicle 4 Legal 1 Legal 2 … Legal 9 Tagged documents Tesseract OCR Text to numbers Classifier e-text matrixtrain set
  • 7. Cataloguer • Unbalanced categories: enough samples (1000+) in medical and legal categories, but the majority of other categories have only about ~100 samples. • Intrinsic divergence and contradictoriness of the documents • Intersection between the sections of different categories • Semantic homogeneity in different categories • The alien documents are categorized
  • 8. Accuracy reported to the customer ~80% Accuracy paradox Legal 1 Legal 2 Arghhh!
  • 9. Tagged documents Neural Network HAND-WRITTEN TEXT DETECTOR train set Hanwritten or Non- handwritten tag Data Science exists not by accuracy alone
  • 10.
  • 11. The hardest thing of all is to find a black cat in a dark room, especially if there is no cat.” Confucius
  • 12. TEXT MRI, CT… MANUAL THERAPY / CHYROPRACTIC REPORT QUESTIONNAIRE BLOOD TEST POOR-QUALITY TEXT Text extraction: not for the squeamish
  • 13. TEXT EXTRACTION: FUZZY MATCHING Patient’s information - name - weight and height - DOB - date of injury, - dates of medical records Text of medical record from OCR Medical records summary - diagnoses - medical history - treatments and procedures - ICD-10, ICD-9 Doctor’s name Magic potion of fuzzy- matching, NLP, rule-based approach, and common sense
  • 14. Fuzzy matching concept • Looks for words (phrases) in text, which fuzzy matches the search word and returns the location of this word, threshold, and the matching word (phrase) itself • Adaptive threshold for words of different length. 3- and 4-characters words are matched directly. The longer the word (phrase), the closer threshold to one. • The text following the found word contains relevant information • Used for patient’s info and sections search in medical records; simplifed version is used to purify the extracted information. Examples DIAGNOSIS = DIAGNOSES, DIANOSIS, DIAGNOSE but not DIAGNOSTIC (mpression, 0.92, (0, 10), impression) (impression, 0.92, (0, 10), impressio) (surgical history, 0.92, (0, 10), sociai history) Lessons Learned • Inappropriate threshold results in extraction of the irrelevant pieces of information • Selection of the appropriate search words is vital (re:, subject, years old, Mr.) • Spelling errors and quality of initial document are often critical
  • 15. … and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed. The said Eliza, John, and Georgiana were now … ('PERSON') ('PERSON') ('PERSON')('PERSON') (PRODUCT')('PERSON') (ORG') John Smith is a pleasant 40-year-old man' ('PERSON') john smith is a pleasant 40-year-old man' (???') Name extraction: where NER fails SPACY Stanford Сore NLP was additionally trained for names and medical records extraction: worked better, but not always accurate
  • 16. • Search words • NER + exclusion techniques • Frequency analysis • Fuzzy match with names database • Most frequent name and its frequency score • Name with the highest sum of scores from many documents is a real name • For doctor’s name extraction, Stanford NER + direct match with search words proved to be efficient. Custom name extraction 261 124 373 129 382 390 266 0 50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7 Namefrequency Cases Frequency of extracted patients' names from case documents Correct name Incorrect name
  • 17. Medical sections extraction • Search words according to HL7 Clinical Document Architecture, extended • Take next section after, and then next, next, until ….something happens • No assumptions for location of the anchor words can be made • In some cases, there is no other way to exclude irrelevant info, but use exclusion words (diagnostic studies, compression, etc.)
  • 18. Medical sections refinement Refinement principles - split into simple sentences; - fuzzy matching to filter phrases (according to domain expert guidelines); - exclusion of irrelevant statements.
  • 19. BUT
  • 20. ICD: critical for insurance company International Classification of Diseases, a system of codes with critical information about epidemiology, managing health, and treating conditions. Insurance companies use ICD codes to classify conditions and determine reimbursement. Doctors mark diagnoses with ICD codes. Insurance companies are strict about having structured document, with diagnoses coded appropriately. ICD-9: ~ 14,000 codes ICD-10: ~ 70,000 codes
  • 22. ICD-9: 850.9 Concussion ICD-9: 723.1 Cervalgia How we extract ICD Doctor writes “Cardiopulmonary disease” Doctors mark diagnoses with ICD codes (reluctantly, flexible, as they feel today). I27.9 - Pulmonary heart disease, unspecified
  • 23. • Find ICD using regex • Relevance validation. Check the string in extracted text: - is readable - is not part of irrelevant medical sections - contains numeric and alphanumeric symbols - is not weight, address, phone number, date, blood test, name, • Extract the diagnosis formulation from ICD library • Display ICD and the text in which ICD was found How we extract ICD Aim: to find a code and extract diagnosis formulation from ICD database.
  • 24. DEMO
  • 25. 6-page document Artificial Medical Document Fuzzy Extraction Result
  • 26. 15-page document Artificial Medical Document Fuzzy Extraction Result ICD
  • 27. Q & A

Editor's Notes

  1. Documents without specific format, pre-defined or standardized structure High-quality pdfs and low-resolution scans and faxes Tesseract-ocr & Apache Lucene spellchecker Apache cTakes= clinical Text Analysis and Knowledge Extraction System Plain text and lxml-structured text Page-wise processing Errors in poor-quality documents OCR limitations are intractable and they determine the extraction efficiency 
  2. Documents assigned to categories: obtained from customer (domain expert required) Why not embeddings ? ML not by the ML alone
  3. A user can view pages on which handwritten page is present Analyze the document page by page, threshold = 0.95 Detects if a handwritten paragraph of about 30% page area is present on the page. Signatures, ticks, few words passages are not detected as handwritten records
  4. C-takes!