Assessing Doctor Consultations using ML

Assessing the Quality of
Doctor Consultations using ML
Pranjal Aswani
Data Engineer @ Halodoc

PATIENT
S
PAYERS PROVIDERS
Halodoc upgrades
relationship between Patients,
Payers and Providers

Halodoc services
ONLINE OFFLINE

But how are online consultations helping
people?

Number of Physicians
Per 10,000 Population
OECD 2018

30
UK
28.1
OECD 2018

30
UK
28.1
US
24.5
South Korea
22.0
Singapore
19.5
World
average
13.9
OECD 2018

30 0
UK
28.1
US
24.5
South Korea
22.0
Singapore
19.5
World
average
13.9
Malaysia
12.0
Thailand
3.9
Indonesia
3.0
OECD 2018

Healthcare unavailability
because of archipelagic
geography
Halodoc online
consultations
vs

User’s journey for online
consultation and providing
feedback

Why assess consultation feedbacks?

Consultation feedbacks
50% of the total consultations
are rated
As of
December’ 19

1 Bad
experience
with a doctor
Negative
app rating

Customer feedbacks help in providing
actionable feedback to the doctors

How do we assess the quality of
consultations?

- Manually look at random consultations by different doctors, which contain
feedback and point out the mistakes the doctors are making
- Quantitative: Look at metrics a consultation produces (num messages sent,
length, notes, etc) and flag obvious consultations (0 messages, late response from
doctor, etc)
- Qualitative: Look at how the consultation did on SAPE
- Check if in the end, the patient’s problem was solved or not
Taking help of humans to analyse
consultations

● SAPE is a modified version of SOAP, which is a global standard for assessing consultations
○ Subjective: retrieve information from patient by asking questions
■ Main symptom (high fever), Additional Symptom (body ache)
○ Objective: vitals, measurements (temperature, BP)
○ Assessment: explain the ailment to the patient and why it happened
■ Differential diagnosis (viral fever different from flu), Possible etiology (viral)
○ Planning: necessary steps for the patient to get better and preventative measures
■ Lifestyle modification (rest, light food), recommendation (paracetamol)
○ Etiquette: politeness and empathy towards the patient (replacement for O)
■ Opening (hello) and closing Etiquette (good bye!)
● O is difficult for online consultation and we measure E, instead
Consultation metrics based on “SAPE”

How can we use Machine Learning for
this?

Consultation quality with SAPE
USER COMMUNICATES
THEIR PROBLEM
FOLLOW UP QUESTIONS
FROM OUR DOCTOR
Subjective

DIAGNOSIS BY DOCTOR
MEDICINE
RECOMMENDATIONS
Assessment
Planning
Consultation quality with SAPE … contd

DOCTOR ETIQUETTESEtiquette
Consultation quality with SAPE … contd

● Given an anonymized chat consultation transcript between the doctor & patient,
automate SAPE scoring
● SAPE scores are in the range {0, 0.5, 1.0}
● Goals
○ Actionable feedback to the doctors
○ Improve the consultations quality on the platform
○ Auto summarisation of chat consultations ( => Doctor notes)
Problem statement

● NLP in bahasa Indonesia
○ Limited NLP resources
○ NLP with medical terms
○ Translate to English?
● Avoiding bias in the dataset
○ Positive and negative feedback consultations
○ Consultation category
● Training dataset
○ Equally distributed labelled data
○ Oversampling or Undersampling
Tech Challenges

Evaluated Approaches
- Quantitative (numbers)
- Qualitative (context)

Numeric features based on RED (Responsiveness, Effort and Diligence) scores created
to assess doctors
- Responsiveness: Acceptance Time , First reply time, messages with response time
more than 1 min
- Effort: Doctor-patient message ratio
- Diligence: Notes depth, completed time
Created features based on this domain knowledge
Quantitative-Round 1: using downvoted consultations data
- Average response time
- Average length of message
- Number of messages sent
- Duration of consultation
- Chat closed by
- eRx issued or not
- Doctor patient chat ratio
- Number of questions asked (based on a question classifier)

What kind of data did we start with?
● Team of intern doctors manually labelling the data
● Around 6K labelled dataset
○ Data in google sheets
○ RED scores at a doctor level + SAPE scores at a consultation level
● Bias in the data (only thumbs down consultations were considered)

Find relations between quantitative features (RED scores) + SAPE scores (tags)
- Collate all the available data, clean it and remove duplicates
- Each sub-category is given a 0.5 or 0.0 score (available tags)
- Combine sub-category scores go get scores for the category itself (0, 0.5, 1.0)
- Train decision trees and neural networks using quantitative features
- Hyperparameter (variables of the models which can be messed with for a better
output) tuning

Test accuracies
Production accuracies
A coin toss would’ve given us a better
accuracy score than this

Why didn’t it work?
- Less data (around 7K consultations after cleaning the data, all downvoted)
- Question classifier (a basic dictionary approach) was not that accurate (~40%)
- We had more data for certain scores (1.0) and less for others (0.0, 0.5)
(imbalanced dataset)
How do you make it better?
- Collect better data
- Feature engineering to find more possible features (such as relation of age of patients to the
consultation duration)
- Sampling
- Create a better question classifier (needed for Subjective category in SAPE)

Data Collection
- Clean and better structured data
- Input for RED scores
- Sentence level tagging
- No PII information exposed to
the intern doctors, as compared
to earlier
- Generating reports from the
data for doctors
- Average of ~300 consultations
being tagged per day (as
compared to 100-150 per day)

Quantitative-Round 2: using better data + sampling + better features
- Sampling (over and under, per category) to fix imbalanced dataset
- Created a better question classifier using Support Vector Machine + TfIdf ( 43% more accurate )
- Feature engineering to find more meaningful features
- Age of patient/doctor
- Gender of patient/doctor
- Type of consultation (general, pediatric, OBGYN)
- Re-generate features and fill missing values with averages
- More data
- More enthusiasm!!

Subjective category results on test set
Previous results on test set

Our models failed because the
distribution of our features
across the 3 scores was
almost identical
The models couldn’t find any
patterns because there weren’t
any!
But why didn’t you do it
earlier?
→ Because we only had one
type of data: downvoted
Rethink the
features!

Qualitative: finding context in a consultation
What do we have?
- Sentences tagged at a category level per consultation
- Scores for each of the categories for a consultation (0, 0.5, 1 across S, A and P)
Main idea:
- For each of the categories, there will be words and word pairs (n-grams) which occur only in
sentences of that category
- Exploit this for each of the categories

- To predict the score for Assessment (0 or 1) based on the chat sentences available
- Total sentences: ~40 K
- Take all A sentences of a consultation and combine them into 1 sentence
- Generate Tf-Idf vector for all those consultation sentences
- Train an SVM to predict the value of A for that consultation as 0 or 1
- ~70% accuracy on test set
Pilot: Assessment classifier

Category classifiers
- Create individual category classifiers
to classify sentences as S, A or P
sentences
- Use the classified sentences per
category and feed it to the sub
category classifiers
- Train sub-category (main symptom,
additional symptom for Subjective
category) classifiers to predict scores
0, 0.5 using sentences for that
category

Results?
Category classifier test accuracy
Sub category classifier test accuracy
The final models were chosen
after experimenting with a
dozen different kinds
algorithms

Prod results
Subjective accuracy of 70% (+25% improvement over quantitative techniques)
Assessment accuracy of ~62% (~45% improvement on accuracy)
Planning accuracy of ~57% (~40% improvement over previous algorithms).

● Quantity and Quality of the dataset
● Avoid bias in dataset
● Metrics to measure the impact
● Setting expectations with the business stakeholders
● Working with uncertainty
Learnings

Next steps
- Get more data
- Tag patient level sentences to get the full context of the consultation
- Create better models using word2vec
- Repeat

Excited?
Join us!
careers.india@halodoc.com
www.halodoc.com
blogs.halodoc.io
● https://www.linkedin.com/in/pranjalaswani/
● https://www.linkedin.com/in/rdurgam/

Assessing Doctor Consultations using ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Assessing Doctor Consultations using ML

Similar to Assessing Doctor Consultations using ML (20)

More from GDG Cloud Bengaluru

More from GDG Cloud Bengaluru (11)

Recently uploaded

Recently uploaded (20)

Assessing Doctor Consultations using ML