This document discusses Halodoc's efforts to use machine learning to assess the quality of online doctor consultations. It began by manually analyzing consultations but found limited success. It then collected more consultation data and used both quantitative metrics and qualitative NLP techniques to improve accuracy. By experimenting with different models and features, it achieved up to 70% accuracy for subjective categories and 62% for assessments, marking a significant improvement over initial methods. Ongoing work includes collecting more detailed data to build better models.
19. How do we assess the quality of
consultations?
20. - Manually look at random consultations by different doctors, which contain
feedback and point out the mistakes the doctors are making
- Quantitative: Look at metrics a consultation produces (num messages sent,
length, notes, etc) and flag obvious consultations (0 messages, late response from
doctor, etc)
- Qualitative: Look at how the consultation did on SAPE
- Check if in the end, the patient’s problem was solved or not
Taking help of humans to analyse
consultations
21. ● SAPE is a modified version of SOAP, which is a global standard for assessing consultations
○ Subjective: retrieve information from patient by asking questions
■ Main symptom (high fever), Additional Symptom (body ache)
○ Objective: vitals, measurements (temperature, BP)
○ Assessment: explain the ailment to the patient and why it happened
■ Differential diagnosis (viral fever different from flu), Possible etiology (viral)
○ Planning: necessary steps for the patient to get better and preventative measures
■ Lifestyle modification (rest, light food), recommendation (paracetamol)
○ Etiquette: politeness and empathy towards the patient (replacement for O)
■ Opening (hello) and closing Etiquette (good bye!)
● O is difficult for online consultation and we measure E, instead
Consultation metrics based on “SAPE”
26. ● Given an anonymized chat consultation transcript between the doctor & patient,
automate SAPE scoring
● SAPE scores are in the range {0, 0.5, 1.0}
● Goals
○ Actionable feedback to the doctors
○ Improve the consultations quality on the platform
○ Auto summarisation of chat consultations ( => Doctor notes)
Problem statement
27. ● NLP in bahasa Indonesia
○ Limited NLP resources
○ NLP with medical terms
○ Translate to English?
● Avoiding bias in the dataset
○ Positive and negative feedback consultations
○ Consultation category
● Training dataset
○ Equally distributed labelled data
○ Oversampling or Undersampling
Tech Challenges
29. Numeric features based on RED (Responsiveness, Effort and Diligence) scores created
to assess doctors
- Responsiveness: Acceptance Time , First reply time, messages with response time
more than 1 min
- Effort: Doctor-patient message ratio
- Diligence: Notes depth, completed time
Created features based on this domain knowledge
Quantitative-Round 1: using downvoted consultations data
- Average response time
- Average length of message
- Number of messages sent
- Duration of consultation
- Chat closed by
- eRx issued or not
- Doctor patient chat ratio
- Number of questions asked (based on a question classifier)
30. What kind of data did we start with?
● Team of intern doctors manually labelling the data
● Around 6K labelled dataset
○ Data in google sheets
○ RED scores at a doctor level + SAPE scores at a consultation level
● Bias in the data (only thumbs down consultations were considered)
31. Find relations between quantitative features (RED scores) + SAPE scores (tags)
- Collate all the available data, clean it and remove duplicates
- Each sub-category is given a 0.5 or 0.0 score (available tags)
- Combine sub-category scores go get scores for the category itself (0, 0.5, 1.0)
- Train decision trees and neural networks using quantitative features
- Hyperparameter (variables of the models which can be messed with for a better
output) tuning
33. Why didn’t it work?
- Less data (around 7K consultations after cleaning the data, all downvoted)
- Question classifier (a basic dictionary approach) was not that accurate (~40%)
- We had more data for certain scores (1.0) and less for others (0.0, 0.5)
(imbalanced dataset)
How do you make it better?
- Collect better data
- Feature engineering to find more possible features (such as relation of age of patients to the
consultation duration)
- Sampling
- Create a better question classifier (needed for Subjective category in SAPE)
34. Data Collection
- Clean and better structured data
- Input for RED scores
- Sentence level tagging
- No PII information exposed to
the intern doctors, as compared
to earlier
- Generating reports from the
data for doctors
- Average of ~300 consultations
being tagged per day (as
compared to 100-150 per day)
35. Quantitative-Round 2: using better data + sampling + better features
- Sampling (over and under, per category) to fix imbalanced dataset
- Created a better question classifier using Support Vector Machine + TfIdf ( 43% more accurate )
- Feature engineering to find more meaningful features
- Age of patient/doctor
- Gender of patient/doctor
- Type of consultation (general, pediatric, OBGYN)
- Re-generate features and fill missing values with averages
- More data
- More enthusiasm!!
37. Our models failed because the
distribution of our features
across the 3 scores was
almost identical
The models couldn’t find any
patterns because there weren’t
any!
But why didn’t you do it
earlier?
→ Because we only had one
type of data: downvoted
Rethink the
features!
38. Qualitative: finding context in a consultation
What do we have?
- Sentences tagged at a category level per consultation
- Scores for each of the categories for a consultation (0, 0.5, 1 across S, A and P)
Main idea:
- For each of the categories, there will be words and word pairs (n-grams) which occur only in
sentences of that category
- Exploit this for each of the categories
39. - To predict the score for Assessment (0 or 1) based on the chat sentences available
- Total sentences: ~40 K
- Take all A sentences of a consultation and combine them into 1 sentence
- Generate Tf-Idf vector for all those consultation sentences
- Train an SVM to predict the value of A for that consultation as 0 or 1
- ~70% accuracy on test set
Pilot: Assessment classifier
40. Category classifiers
- Create individual category classifiers
to classify sentences as S, A or P
sentences
- Use the classified sentences per
category and feed it to the sub
category classifiers
- Train sub-category (main symptom,
additional symptom for Subjective
category) classifiers to predict scores
0, 0.5 using sentences for that
category
41. Results?
Category classifier test accuracy
Sub category classifier test accuracy
The final models were chosen
after experimenting with a
dozen different kinds
algorithms
42. Prod results
Subjective accuracy of 70% (+25% improvement over quantitative techniques)
Assessment accuracy of ~62% (~45% improvement on accuracy)
Planning accuracy of ~57% (~40% improvement over previous algorithms).
43. ● Quantity and Quality of the dataset
● Avoid bias in dataset
● Metrics to measure the impact
● Setting expectations with the business stakeholders
● Working with uncertainty
Learnings
44. Next steps
- Get more data
- Tag patient level sentences to get the full context of the consultation
- Create better models using word2vec
- Repeat