When Healthcare Meets
Data Science
Anastasiia Kornilova
http://www.slideshare.net/WebCongress/mars-one-bas-lansdorp
http://www.slideshare.net/WebCongress/mars-one-bas-lansdorp
The Medicine of the Future
http://www.healthbizdecoded.com/2013/05/hies-meeting-the-sustainability-challenge/
http://graphics.wsj.com/infectious-diseases-and-vaccines/
«One or two patient died per week in a
certain smallish town because of the lack
of information flow between the
hospital’s emergency room and the
nearby mental health clinic»
[«Doing Data Science», O’Neil ]
60% of US doctors still use
paper medical records
Let’s create our own EHR standard
Patient
gender
Code
Male 0
Female 1
Patient
gender
Code
Male 1
Female 0
Patient
gender
Code
Male M
Female F
Unknown U
Let’s code gender
Standart A
Standart B
Standart C
x
x
[Image Source]
There 5 key data standards
ICD - diagnostic, billing, world-wide
CPT - procedures, billing, US-specific, classification
LOINC - lab tests and observations, world-wide
NDC - medication, US-specific, classification
SNOMED - medicine
… and a lot of custom standards
Even within one data standard:
ICD-9
174 malignant neoplasm of female breast
174.1 malignant neoplasm of central portion of female breast
ICD-10
C50 malignant neoplasm of breast
C50.1 malignant neoplasm of central portion of breast
C50.111 malignant neoplasm of central portion of right female
breast
C50.111 malignant neoplasm of central portion of left female
breast
You have to be a doctor to handle them
Problem summary
Standart 1
Standart 2
Standart N
medicine expertise
a lot of (expensive) hours
Knowledge
Standarts are changing
Artificial Intelligence Way
Feed a lot of medical texts to
«medical doctor»
Use NLP power
Make it unsupervised
Key idea:
«Semantically similar words occurs in similar
contents» Harris, 1954
«You shall know a word by the company it
keeps», Firth, 1957
«It was the year when Udacity, Coursera and edX, the three
leading MOOC companies, took the education world by storm
and promised a lot» [Huffington Post]
«Many places offer MOOCs, and many more will. But
Coursera, Udacity and edX are the leading
providers.» [NYTimes]
Distributed Vectors
Representation
Two layer neural network
Input: text corpus
Output: set of vectors
Group the vectors of similar
words together in vector
space (detects similarities
matematically)
Predict a word using content
All
you
need
love
is
Resulting vectors
All
you
need
is
love
[0.2, 0.11, 087, 0.9, … , 0.2]
[0.1, 0,98, 01, 0.26, …, 0.82]
[0.7, 0.22, 0.3, 0.1, …, 0.45]
[0.5, 0.21, 0,67, 0.82,…, 0.49]
[0.6, 034, 0.21, 0.45,…, 0.2]
Vectors Relationships
Vectors Relationships
http://nlp.stanford.edu/projects/glove/images/company_ceo.jpg
http://nlp.stanford.edu/projects/glove/images/comparative_superlative.jpg
ICD-9
174 malignant neoplasm of female breast
174.1 malignant neoplasm of central
portion of female breast
ICD-10
C50 malignant neoplasm of breast
C50.1 malignant neoplasm of central portion of
breast
C50.111 malignant neoplasm of central portion
of right female breast
C50.111 malignant neoplasm of central portion
of left female breast
Summary
Links
Efficient Estimation of Word Representation in
Vector Space (Mikolov)
Distributed representation of words and phrases and
their compositionality (Mikolov)
word2vec Parameter Learning Explaining (Rong)
Questions?

When Healthcare Meets Data Science (Anastasiia Kornilova Technology Stream)