MACHINE LEARNING
APPLIED IN HEALTH
WHO AM I?
Javier Samir Rey
Systems engineer
Machine learning engineer - Direktio
Co-organizer meetup Big Data Colombia
jreyro@gmail.com
javier-samir-rey-7104195
github/jasam
“Work on Stuff that Matters”
Tim O'Reilly
Source: United Nations - Sustainable Development Goals
SUSTAINABLE DEVELOPMENT GOALS
3 - GOOD HEALTH AND WELL-BEING
“Ensure healthy lives and promoting
well-being for all at all ages.”
● Reproductive maternal and child
health.
● Communicable, non-communicable
and environmental diseases.
● Health risk reduction and
management.
● Universal health coverage.
NO AND COMMUNICABLE DISEASES
The incidence of major infectious
diseases: HIV, tuberculosis and
malaria.
Almost half the world’s
population is at risk of malaria.
889,000 people died from
infectious diseases caused largely by
faecal contamination of water.
40 millions global death were due NCDs.
48% deaths were premature.
75% of premature deaths were caused by
cardiovascular disease,
cancer, diabetes and chronic
respiratory disease.
80% of heart disease, stroke and diabetes
can be prevented.
Source: United Nations
CDs NCDs
Noncommunicable diseases (NCDs), also known as chronic diseases, tend
to be of long duration and are the result of a combination of genetic,
physiological, environmental and behaviours factors.
Detection, screening and treatment of NCDs, as well as palliative care, are
key components of the response to NCDs.
An important way to control NCDs is to focus on reducing the risk factors
associated with these diseases. Low-cost solutions exist for
governments and other stakeholders to reduce the common modifiable
risk factors. Monitoring progress and trends of NCDs and their risk is
important for guiding policy and priorities.
NON COMMUNICABLE DISEASES
Decreased quality
of life of the
human being.
IMPACT
In low-resource settings,
health-care costs for NCDs
quickly drain
household resources.
The exorbitant costs of
NCDs, including often
lengthy and expensive
treatment and loss of
breadwinners, force millions
of people into poverty
annually and stifle.
Hypertension and
Diabetes Mellitus
COLOMBIA NO COMMUNICABLE DISEASES
major precursors of
- Ischemic cardiovascular disease
- Cerebrovascular events
- End-stage renal disease
- Death
prevalence
- Hypertension: 6.5 %
- Diabetes: 1.9 %
20% of the population
consumes 80%of the
resources.
Source: cuenta de alto costo
SOME REVIEW
Data is quickly emerging as the greatest asset of the
healthcare industry. The trend in our industry is to drive many
decisions supported by data. it is a walk of maturity with the real gold
nuggets coming in Analytics 3.0 and beyond. This will not be solved
with a product or purchased off the shelf. Big Data needs to be
part of the DNA of an organization.
-- Chris Belmont, MBA
Vice President and Chief Information Officer
MD Anderson Cancer Center
“I know that 50% of my
advertising is wasted, I just
don’t know which half.”
WANAMAKER’S QUESTION
Healthcare industry is now awash in data in a way that it has never
been before: biological, gene expression, sensors, DNA, sequence, EHRs,
drug and medicals. We have entered a new era in which we can work on
massive datasets effectively combining it. We can start asking
the important questions, the wanamaker questions!
The opportunities are huge!.
Source: wikipedia
THE HERO’S JOURNEY
Source: Wikipedia
BIG PLAYERS
BUSINESS UNDERSTANDING DATA VALUE
PYRAMID
Source: datasyndrome
AGILE DATA SCIENCE MANIFIESTO
Source: agile data science 2.0
Iterate, iterate, iterate: tables, charts, reports, predictions
- roadmap projects.
1
Integrate the tyrannical opinion of data in product
management.
4
Ship intermediate output. Even failed experiments have
output.
2
Prototype experiments over implementing tasks.
3
AGILE DATA SCIENCE MANIFIESTO
Source: agile data science 2.0
Climb up and down the data-value pyramid as we work.
5
Discover and pursue the critical path to a killer product.
6
Get meta. Describe the process, not just the end-state.
7
CRISP-DM METHODOLOGY
Source: Wikipedia
Before define
your framework
(agile is a
possibility), first
define your
culture and
team.
Cross Industry
Standard
Process for Data
Mining
BUSINESS UNDERSTANDING
It is one of the most
important concepts of
data science!
It is vital to understand the problem to be solved and context.
1
Often recasting the problem and designing a solution is an
iterative process of discovery.
2
The Business Understanding stage represents a part of the
craft where the analysts’ creativity plays a large role.
3
BUSINESS UNDERSTANDING
It is one of the most
important concepts of
data science!
The key to a great success is a creative problem formulation how
to cast the business problem as one or more data science
problems (subproblems).
4
What is the expected value.
5
Team’s help is really important, we are not alone.
6
BUSINESS UNDERSTANDING - HEALTH
Source: mckinsey and company
Big data has a higher
potential in 3 ways:
● Precision medicine
● Diagnose diseases
● Optimize clinical
trials
BUSINESS UNDERSTANDING - HEALTH
ACTORS
● Clinicians, domain experts and financial
analysts
● Managers, IT developers, consultants and
vendors
● Policy makers
● Patients and consumers
● Executives and lines-of-business leaders
● Researches and academia
● Health institutions
● Society
Build your strategy together!
BUSINESS UNDERSTANDING - HEALTH
CHRONIC CONDITIONS CARE MODEL
Source: Cuidado das Condições Crônicas na Atenção Primária à Saúde
Inspired by
the pyramid
of Kaiser
Permanente!
DATA UNDERSTANDING
Solving the business problem is the goal.
1
It is important to understand the strengths and
limitations of the data because rarely is there an exact
match with the problem.
2
Some data will be available virtually for free while others
will require effort to obtain.
3
Cleaning and matching different sources in only one record
match is itself could be a complicated analytics problem
4
DATA UNDERSTANDING
Remember all V’s about data: volume, velocity, variety,
variability, veracity, visualization and value.
5
Design and build data engineering team that supports
your data requirements.
6
Data Governance DAMA (Data Management Association
International)
7
DATA UNDERSTANDING - HEALTH
SOURCES FOR DATA IN HEALTHCARE
Healthcare data Examples
Images Radiographic, Images, MRIs, Ultrasounds and Nuclear
imaging
Un-/semi-structured Clinical narratives, Physician notes, Level 2,3 OMICS,
Summaries, Pathology reports
Streaming Bedside, remote monitors, Implants, fitness bands, smart
watches and smart phones
Social media Facebook, Twitter, Web forums and communities
Structure data All claims, EHR, ERP and other information systems
Dark data Server logs, application error logs, account information,
emails and documents
DATA UNDERSTANDING - HEALTH
Source: The Rise of Consumer Health Wearables
DATA UNDERSTANDING - HEALTH
Source: mckinsey and company
DATA UNDERSTANDING - HEALTH
Source: mckinsey and company
DATA PREPARATION
The analytic technologies could be powerful but they impose
certain requirements on the data they use (data table).
1
Typical examples of data preparation are converting data
to tabular format
2
Feature engineering.3
Technology is important but this is not the main point.4
DATA PREPARATION
The process defining the variables. This is one of the main
points at which human creativity, common sense,
and business knowledge come into play.
4
Document your time process.
5
Think optimization process -Big O6
Little blocks of processing - plan for scale7
DATA PREPARATION - COMPUTING BOUND
Source: hadoop in the enterprise: architecture
DATA PREPARATION - DATA ENGINEERING
Pair review
Modularize
your project
Create professional
projects - world
class solutions
using: versioning,
standards, right
tools, unit tests.
DATA PREPARATION - TABULAR FORM - THE GOAL
Primary care
Secondary care
Medication
Other data… a lot of
types
ID age med height weight BMI diet
1 15 Y 168 60 21.3 Y
2 20 Y 185 80 23.4 Y
3 65 N 192 90 24.4 N
4 48 N 172 85 28.7 N
5 45 Y 185 79 23.1 N
6 79 N 182 71 21.4 Y
7 22 Y 186 79 22.8 Y
Feature engineering
Data points this is
the key (N*M)! After a
very expensive
process
To put data together
is challenging
Data engineering
N features
Mobservations
DATA MODELING
The creation of models from data is known as model induction.
Induction is a term from philosophy that refers to generalizing from
specific cases to general rules (or laws, or truths).
Source: Data science for business
Generally speaking, a model is a simplified representation of
reality created to serve a purpose.
In data science, a predictive model is a formula for estimating the
unknown value of interest: the target. The formula could be
mathematical, or it could be a logical statement such as a rule. Often it is
a hybrid of the two.
Many Names for the
Same Things!.
DATA MODELING - BEST PRACTICES
Ask a specific question, Remember you are solving a
business problem, not a math problem.
1
Start simple, start with the minimal set of data.
2
Try many algorithms but remember that data is more
important than the exact algorithm, better your features.
3
Treat your data with suspicion, understand its
idiosyncrasy.
4
Normalize your inputs
5
DATA MODELING - BEST PRACTICES
● Validate your model (set validation and clinical)
● Do the benchmark attempt, don’t be afraid to launch your product
without ML
● Set up a feedback loop
● Healthcare doesn’t trust black boxes
● Correlation is not causation
● Monitor ongoing performance
● Don’t be fooled by “accuracy”
● Labeled data
● Use medical support libraries eg: pubmed, cochrane, American
Heart Association, Diabetes UK and so on.
DATA MODELING - BLUEPRINT
Source: sci-kit learn
DATA MODELING - TRADE OFF
Source: oreilly strata 2013
DATA MODELING - TECHNOLOGY
DATA MODELING - TOOLS
Reproducible research
is great!
DATA MODELING - END OF THE HERO’S JOURNEY!
DATA MODELING - USE CASE - ELSEVIER
RISK PREDICTIONS: WHICH DISEASE WILL YOU
LIKE GET WITHIN 4 YEARS
1600+ models
integrated into a
same
information
system.
Source: Elsevier Medical Graph - slideshare
DATA MODELING - USE CASE - ELSEVIER
Source: Elsevier Medical Graph - slideshare
Physician want
explanations.
Otherwise they
will not trust
the predictions
Typical best-in-class
classification methods
(deep learning, random
forest) do not yet deliver
explainable models.
In practice, you
need to save the
users processing
time, not add to it.
Visualization is
key.
Building a classification
model using open source
tools is simple. Scaling
input data size is also
manageable. Building
1000+ models is complex.
Open source tools have
failures (as have proprietary
tools). Debugging can be a
nightmare.
Implementing, applying
and maintaining a
security framework to
keep personal health
information secure is a
substantial effort.
THANKS!

Machine learning applied in health

  • 1.
  • 2.
    WHO AM I? JavierSamir Rey Systems engineer Machine learning engineer - Direktio Co-organizer meetup Big Data Colombia jreyro@gmail.com javier-samir-rey-7104195 github/jasam
  • 3.
    “Work on Stuffthat Matters” Tim O'Reilly
  • 4.
    Source: United Nations- Sustainable Development Goals SUSTAINABLE DEVELOPMENT GOALS
  • 5.
    3 - GOODHEALTH AND WELL-BEING “Ensure healthy lives and promoting well-being for all at all ages.” ● Reproductive maternal and child health. ● Communicable, non-communicable and environmental diseases. ● Health risk reduction and management. ● Universal health coverage.
  • 6.
    NO AND COMMUNICABLEDISEASES The incidence of major infectious diseases: HIV, tuberculosis and malaria. Almost half the world’s population is at risk of malaria. 889,000 people died from infectious diseases caused largely by faecal contamination of water. 40 millions global death were due NCDs. 48% deaths were premature. 75% of premature deaths were caused by cardiovascular disease, cancer, diabetes and chronic respiratory disease. 80% of heart disease, stroke and diabetes can be prevented. Source: United Nations CDs NCDs
  • 7.
    Noncommunicable diseases (NCDs),also known as chronic diseases, tend to be of long duration and are the result of a combination of genetic, physiological, environmental and behaviours factors. Detection, screening and treatment of NCDs, as well as palliative care, are key components of the response to NCDs. An important way to control NCDs is to focus on reducing the risk factors associated with these diseases. Low-cost solutions exist for governments and other stakeholders to reduce the common modifiable risk factors. Monitoring progress and trends of NCDs and their risk is important for guiding policy and priorities. NON COMMUNICABLE DISEASES
  • 8.
    Decreased quality of lifeof the human being. IMPACT In low-resource settings, health-care costs for NCDs quickly drain household resources. The exorbitant costs of NCDs, including often lengthy and expensive treatment and loss of breadwinners, force millions of people into poverty annually and stifle.
  • 9.
    Hypertension and Diabetes Mellitus COLOMBIANO COMMUNICABLE DISEASES major precursors of - Ischemic cardiovascular disease - Cerebrovascular events - End-stage renal disease - Death prevalence - Hypertension: 6.5 % - Diabetes: 1.9 % 20% of the population consumes 80%of the resources. Source: cuenta de alto costo
  • 10.
    SOME REVIEW Data isquickly emerging as the greatest asset of the healthcare industry. The trend in our industry is to drive many decisions supported by data. it is a walk of maturity with the real gold nuggets coming in Analytics 3.0 and beyond. This will not be solved with a product or purchased off the shelf. Big Data needs to be part of the DNA of an organization. -- Chris Belmont, MBA Vice President and Chief Information Officer MD Anderson Cancer Center
  • 11.
    “I know that50% of my advertising is wasted, I just don’t know which half.” WANAMAKER’S QUESTION Healthcare industry is now awash in data in a way that it has never been before: biological, gene expression, sensors, DNA, sequence, EHRs, drug and medicals. We have entered a new era in which we can work on massive datasets effectively combining it. We can start asking the important questions, the wanamaker questions! The opportunities are huge!. Source: wikipedia
  • 12.
  • 13.
  • 14.
    BUSINESS UNDERSTANDING DATAVALUE PYRAMID Source: datasyndrome
  • 15.
    AGILE DATA SCIENCEMANIFIESTO Source: agile data science 2.0 Iterate, iterate, iterate: tables, charts, reports, predictions - roadmap projects. 1 Integrate the tyrannical opinion of data in product management. 4 Ship intermediate output. Even failed experiments have output. 2 Prototype experiments over implementing tasks. 3
  • 16.
    AGILE DATA SCIENCEMANIFIESTO Source: agile data science 2.0 Climb up and down the data-value pyramid as we work. 5 Discover and pursue the critical path to a killer product. 6 Get meta. Describe the process, not just the end-state. 7
  • 17.
    CRISP-DM METHODOLOGY Source: Wikipedia Beforedefine your framework (agile is a possibility), first define your culture and team. Cross Industry Standard Process for Data Mining
  • 18.
    BUSINESS UNDERSTANDING It isone of the most important concepts of data science! It is vital to understand the problem to be solved and context. 1 Often recasting the problem and designing a solution is an iterative process of discovery. 2 The Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role. 3
  • 19.
    BUSINESS UNDERSTANDING It isone of the most important concepts of data science! The key to a great success is a creative problem formulation how to cast the business problem as one or more data science problems (subproblems). 4 What is the expected value. 5 Team’s help is really important, we are not alone. 6
  • 20.
    BUSINESS UNDERSTANDING -HEALTH Source: mckinsey and company Big data has a higher potential in 3 ways: ● Precision medicine ● Diagnose diseases ● Optimize clinical trials
  • 21.
    BUSINESS UNDERSTANDING -HEALTH ACTORS ● Clinicians, domain experts and financial analysts ● Managers, IT developers, consultants and vendors ● Policy makers ● Patients and consumers ● Executives and lines-of-business leaders ● Researches and academia ● Health institutions ● Society Build your strategy together!
  • 22.
    BUSINESS UNDERSTANDING -HEALTH CHRONIC CONDITIONS CARE MODEL Source: Cuidado das Condições Crônicas na Atenção Primária à Saúde Inspired by the pyramid of Kaiser Permanente!
  • 23.
    DATA UNDERSTANDING Solving thebusiness problem is the goal. 1 It is important to understand the strengths and limitations of the data because rarely is there an exact match with the problem. 2 Some data will be available virtually for free while others will require effort to obtain. 3 Cleaning and matching different sources in only one record match is itself could be a complicated analytics problem 4
  • 24.
    DATA UNDERSTANDING Remember allV’s about data: volume, velocity, variety, variability, veracity, visualization and value. 5 Design and build data engineering team that supports your data requirements. 6 Data Governance DAMA (Data Management Association International) 7
  • 25.
    DATA UNDERSTANDING -HEALTH SOURCES FOR DATA IN HEALTHCARE Healthcare data Examples Images Radiographic, Images, MRIs, Ultrasounds and Nuclear imaging Un-/semi-structured Clinical narratives, Physician notes, Level 2,3 OMICS, Summaries, Pathology reports Streaming Bedside, remote monitors, Implants, fitness bands, smart watches and smart phones Social media Facebook, Twitter, Web forums and communities Structure data All claims, EHR, ERP and other information systems Dark data Server logs, application error logs, account information, emails and documents
  • 26.
    DATA UNDERSTANDING -HEALTH Source: The Rise of Consumer Health Wearables
  • 27.
    DATA UNDERSTANDING -HEALTH Source: mckinsey and company
  • 28.
    DATA UNDERSTANDING -HEALTH Source: mckinsey and company
  • 29.
    DATA PREPARATION The analytictechnologies could be powerful but they impose certain requirements on the data they use (data table). 1 Typical examples of data preparation are converting data to tabular format 2 Feature engineering.3 Technology is important but this is not the main point.4
  • 30.
    DATA PREPARATION The processdefining the variables. This is one of the main points at which human creativity, common sense, and business knowledge come into play. 4 Document your time process. 5 Think optimization process -Big O6 Little blocks of processing - plan for scale7
  • 31.
    DATA PREPARATION -COMPUTING BOUND Source: hadoop in the enterprise: architecture
  • 32.
    DATA PREPARATION -DATA ENGINEERING Pair review Modularize your project Create professional projects - world class solutions using: versioning, standards, right tools, unit tests.
  • 33.
    DATA PREPARATION -TABULAR FORM - THE GOAL Primary care Secondary care Medication Other data… a lot of types ID age med height weight BMI diet 1 15 Y 168 60 21.3 Y 2 20 Y 185 80 23.4 Y 3 65 N 192 90 24.4 N 4 48 N 172 85 28.7 N 5 45 Y 185 79 23.1 N 6 79 N 182 71 21.4 Y 7 22 Y 186 79 22.8 Y Feature engineering Data points this is the key (N*M)! After a very expensive process To put data together is challenging Data engineering N features Mobservations
  • 34.
    DATA MODELING The creationof models from data is known as model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths). Source: Data science for business Generally speaking, a model is a simplified representation of reality created to serve a purpose. In data science, a predictive model is a formula for estimating the unknown value of interest: the target. The formula could be mathematical, or it could be a logical statement such as a rule. Often it is a hybrid of the two. Many Names for the Same Things!.
  • 35.
    DATA MODELING -BEST PRACTICES Ask a specific question, Remember you are solving a business problem, not a math problem. 1 Start simple, start with the minimal set of data. 2 Try many algorithms but remember that data is more important than the exact algorithm, better your features. 3 Treat your data with suspicion, understand its idiosyncrasy. 4 Normalize your inputs 5
  • 36.
    DATA MODELING -BEST PRACTICES ● Validate your model (set validation and clinical) ● Do the benchmark attempt, don’t be afraid to launch your product without ML ● Set up a feedback loop ● Healthcare doesn’t trust black boxes ● Correlation is not causation ● Monitor ongoing performance ● Don’t be fooled by “accuracy” ● Labeled data ● Use medical support libraries eg: pubmed, cochrane, American Heart Association, Diabetes UK and so on.
  • 37.
    DATA MODELING -BLUEPRINT Source: sci-kit learn
  • 38.
    DATA MODELING -TRADE OFF Source: oreilly strata 2013
  • 39.
    DATA MODELING -TECHNOLOGY
  • 40.
    DATA MODELING -TOOLS Reproducible research is great!
  • 41.
    DATA MODELING -END OF THE HERO’S JOURNEY!
  • 42.
    DATA MODELING -USE CASE - ELSEVIER RISK PREDICTIONS: WHICH DISEASE WILL YOU LIKE GET WITHIN 4 YEARS 1600+ models integrated into a same information system. Source: Elsevier Medical Graph - slideshare
  • 43.
    DATA MODELING -USE CASE - ELSEVIER Source: Elsevier Medical Graph - slideshare Physician want explanations. Otherwise they will not trust the predictions Typical best-in-class classification methods (deep learning, random forest) do not yet deliver explainable models. In practice, you need to save the users processing time, not add to it. Visualization is key. Building a classification model using open source tools is simple. Scaling input data size is also manageable. Building 1000+ models is complex. Open source tools have failures (as have proprietary tools). Debugging can be a nightmare. Implementing, applying and maintaining a security framework to keep personal health information secure is a substantial effort.
  • 44.