SlideShare a Scribd company logo
Big Data & Machine Learning for
Clinical Data
Paul Agapow <p.agapow@imperial.ac.uk>
Data Science Institute, Imperial College London
 Biomedical science is now data
science
 I was a biochemist, immunologist,
and then a infectious disease
bioinformatician
 I’m now a “biomedical data
scientist”
 I will be a Health Informatics
Director at AstraZeneca
About me & these lectures
WikiMedia Commons
 We increasingly use & need:
 Lots of complex data
 Real world evidence (outside RCTs)
 Computational tools
 Statistical analysis
 Complex interactions
 Precision medicine: prediction &
(sub)typing
 Also:
 Cheap
 Successful in other domains
 But lots of hype and jargon
Biomedical science is now data science
WikiMedia Commons
 The world is increasingly
“datafied” – we make more and
bigger datasets
 Devices
 Routine collection
 Aggregation & integration
 Big Data is “too big”for
conventional approaches
Part 1: Big Data
WikiMedia Commons
 “Quantity has a quality of its
own”
 Often free
 Real
 Rich, deep, interactions
 Needed for ML and other
assumption-light approaches
Why Big Data?
By Ender005 - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=49888192
 Many diseases with the same clinical presentation have different
molecular phenotypes
 Several overlapping terms
 stratified: separate patients into groups for treatment
 precision:
 tailor treatment to individual
 improved targeted therapies with fewer side effects
 “Right medication, right dose, right patient, right time, right route”
 Also personalised, P4 …
 E.g. asthma
Why Big Data? Precision medicine
 Volume
 Velocity
 Variety
 Veracity
 Value
The 3 / 4 / 5 Vs of Big Data
By MuhammadAbuHijleh - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=46431834
 Limits labile to technological
progress
 Memory
 Compute
 Data schema
 Solutions: distributed & parallel
computation, new high-end
databases
The problem with volume: tools & platforms
WikiMedia Commons
 Multiple hypothesis testing
and false discovery
 Bias: a sample is not the
population
 The Past is not the Present
 Observation without
understanding
 The curse of dimensionality
 Privacy
 Some ML-specific issues
The problem with volume: methodology
From KDNuggets
 Many, many types of data
 How do we use multiple types?
 Which type do we use?
 Disease is systemic
 Interactions
 Evidence
 Solutions: integrated analysis,
independent analysis with
validation
The problem with variety
Wu, Sanin, Wang (2016) Clinical Applications and Systems
Biomedicine
 Much biodata is uncertain
 Noise
 Mistakes
 People lie
 A sample is not a population
 Incompatible systems
 Most analyses are not reproducible
 Solutions: imputation, standards,
cross-validation etc.
The problem with veracity
By Khaydock - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=25102900
 How do we
 Re-use data
 Compare data
 Store data from multiple sources
 Even know what data is
 FAIR, OHDSI / OMOPS, HPO
 Even just metadata helps for
cataloguing
 But: multiple & incomplete
standards, translation, complexity
Solution: Standards & ontologies
WikiMedia Commons
 Much data cannot leave its
home institution
 Hospitals
 Registries
 Insurance companies
 Governance is hard & slow
 So take the analysis to the data
 Data looks the same but may
be internally different
Solution: Federated analysis
International Collaboration for Autism Registry Epidemiology
 In a vast sea of biodata, how do you
discover anything? How do you avoid
cherry-picking?
 Solutions:
 Distinguish discovery from
exploration
 Non-parametric methods (e.g.
machine learning)
 Some problems don’t have a single
solution but many (e.g. prediction)
The problem with it all: discoverability
EnterpriseKnowledge.com
 Write analyses as recipes
 Snakemake, Nextflow, Flowr
 Use recreatable computational
systems
 Docker
 “Your biggest collaborator is
you, six months ago”
 But: it’s work
Solution: Reproducibility
From RevolutionR
 Big Data is “too big” for current conventional tools & practices
 But it’s ideal for solving many biomedical problems
 There are problems with valid discovery and just handling the data
 Standards, distributed databases and analysis and
Summary: Big Data
 “a field of Artificial Intelligence”
 “(the science of) getting computers to learn and act like humans do”
 “getting computers to act without being explicitly programmed”
 “computer systems that automatically improve with experience”
 “neural networks”
 “using statistical techniques to give computer systems the ability to
learn”
Part 2: Machine Learning
In practice:
 broadly-defined set of
algorithms that recognise &
generalise patterns in data
 “non-parametric” or
assumption-light
 may require training over
initial dataset
What is Machine Learning?
By Chire - Own work, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=11711077
 Enough data
 Enough compute
 Technical progress
 Need 'good enough'
solutions
 Prediction & forecasting
 Categorization
 Pattern recognition
 Early, startling success
Why now?
Ray Kurzweil The Singularity is Near
How is ML different to stats?
How is ML different to stats?
Statistical Machine
Assumptions strong weak
Data small large
Optimize by fitting training
Solutions “the best” “good enough”
Hypothesis proof exploration
Test p-values etc. validation
In practice:
 a field of scientific research
 machine learning
 neural networks
 deep learning
 more of an objective than a methodology
 computational systems that duplicate / emulate / replace human effort
What is Artificial Intelligence
• Many methods
• Broadly split into:
• Unsupervised: finds structure within data
• e.g. (most) clustering, self-organised maps, principal component
analysis
• Supervised: trained using labelled examples
• e.g. regression, decision trees, naive bayes, neural networks
• Categories can blur
• e.g. k-means, nearest neighbour?
• Which is better?
What are ML methods?
• (Train a model from data)
• This model encapsulates or generalizes the data
• (Validate the model against test data)
• This model transforms features into labels
• Continuous outputs (e.g. real numbers) are regressions
• Discrete outputs (e.g. categories) are classifications
ML terms & process
• Take gene expression profiles from patients and cluster to:
• See genes with similar expression profiles
• Similar patients
• Train a model on radiographs with tumours labelled, use to diagnose
unlabelled images
• Find patients with similar symptoms & signs (computational
phenotypes) in HER
• Train on histories of patients to forecast their future condition
• Find out how terms in a medical corpus relate to each other
Examples of ML
It’s everywhere
Unsupervised learning: clustering
 What does ‘similar’ mean? How
do we measure it?
 Which features & how weighted?
 Noise & overlapping clusters
 Non-numeric, non-ordered data
 What shapes can clusters be?
 How many clusters? When do we
stop?
 …
Clustering isn’t simple
By Chire - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=17085331
Varies but:
 Start with record-feature matrix
 Normalise data
 (“Supervised”: select number of
clusters)
 Run algorithm
 Validate
Clustering process
WikiMedia Commons
How not to do it
 A cluster partitioning is a hypothesis
 How do we assess? Validate:
 External: compare against external label or data
 e.g. accuracy, entropy
 Internal: goodness of clustering
 e.g. sum squared errors, cluster cohesion & separation,
silhouette
 Relative: against another clustering scheme
 e.g. is this better with 3 or 4 clusters
Validating clusters
Average over each point:
1. Calculate the average distance to all
other members of its cluster, a
2. For each other cluster, calculate the
average distance to every member.
The minimum of these is b
3. The silhouette width is (b−a) /
max(a,b), the higher the better
Clustering process
What if there are sub-clusters or
structure?
• Use hierarchical clustering
• Use homogeneity or
completeness metrics to
compare
Nesting & hierarchies
• Complex, heterogeneous
disease
• Many attempts at clustering
• Use transcriptomic &
proteomic data
• Validate with clinical
• 4 clusters with characteristic
genes & clinical behaviour
Example: asthma
 a.k.a. deep learning, (artificial)
neural networks, “AI”
 A series of layers of nodes, each of
which transforms the previous layer.
 Training sets weights on
transformations
 Capable of learning representations
Supervised learning: deep networks
WikiMedia Commons
 There’s little information in an
individual pixel (gene, data point …)
 But individual data points make up
more complete entities
 Each layer takes the layer below and
creates higher-level entities
(representations) from it.
 The system “recognises” higher-
level features that can appear
anywhere in the data.
What’s a representation?
WikiMedia Commons
 Radiologists are overwhelmed
 Want to catch errors &
double-check
 Train ANN over medical
imagery with tumour labelled
 Accuracy similar to humans
Example: diagnosis from medical imagery
From Nvidia
• The model is right but learns
the wrong thing (from our
point of view)
• Solutions:
• Interpreting models
• Better (more examined) data
Problem: useless solutions
Ribeiro et al. (2016) Why Should I Trust You?
 Reversing the model & asking “why”
 What features are important
 Mechanistic insight
 But many ML models are tangled & horribly complex
 And ML community often uninterested
 Solutions:
 Choose an intepretable model
 Software that explores feature space (LIME, Lift, IML)
Problem: interpretability
• Bias (systematic error) vs. Variance
(random error)
• Want a model that captures the
regularities in training data AND
generalizes to unseen data.
• This is impossible
• Solutions:
• Use a variety of data
• Feature selection
• Regularization
Problem: how do models get it wrong?
From KDNuggets
• What do we want from our ML
models?
• Power / accuracy
• Insight
• Error tolerance
• e.g. drug discovery vs drug safety
Problem: how good do models have to be?
After Harel
• Much (most) data has few positives
• Results in an imbalanced model
• Solutions:
• Over- and under-sampling
• Pre-train with poor data
• Ensemble methods
Problem: imbalanced data & lack of data
DataScience.com
 Machine learning uses large amounts of data with few assumptions to
make models that generalise that data
 This is useful for situations where we don’t have an explicit model and
just need ‘a’ solution.
 But this means we need to examine our data and validate our
solutions
 A ‘bad’ solution can be useful, depending on what you want to
achieve.
Summary: Machine Learning

More Related Content

What's hot

Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
Ola Spjuth
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
Brittany Lasseigne, Ph.D.
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Greg Landrum
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
Paul Agapow
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...
Ola Spjuth
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
Pistoia Alliance
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
Tao Xie
 
Medical data diagnosis
Medical data diagnosisMedical data diagnosis
Medical data diagnosis
Bhargav Srinivasan
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
Pistoia Alliance
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
NextBio
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
Eagle Genomics
 
AI is the Future of Drug Discovery
AI is the Future of Drug DiscoveryAI is the Future of Drug Discovery
AI is the Future of Drug Discovery
David Leahy
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
Neuroscience Information Framework
 
Social Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in RadiologySocial Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in Radiology
Erik R. Ranschaert, MD, PhD
 
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
The Statistical and Applied Mathematical Sciences Institute
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
IRJET Journal
 

What's hot (19)

Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
 
Medical data diagnosis
Medical data diagnosisMedical data diagnosis
Medical data diagnosis
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
AI is the Future of Drug Discovery
AI is the Future of Drug DiscoveryAI is the Future of Drug Discovery
AI is the Future of Drug Discovery
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Social Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in RadiologySocial Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in Radiology
 
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 

Similar to Big Data & ML for Clinical Data

Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
Timothy Cook
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
Health Catalyst
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...
FranciscoJAzuajeG
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
melissadata
 
Introduction to machine_learning_us
Introduction to machine_learning_usIntroduction to machine_learning_us
Introduction to machine_learning_us
Anasua Sarkar
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Sri Ambati
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
Ian Foster
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
AdhySugara2
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frank Rybicki
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
Robert Grossman
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
DayOne
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
Anita de Waard
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...
Fondazione Giannino Bassetti
 
(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging
Kyuhwan Jung
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical Devices
PremNarayanan6
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in Healthcare
Paul Agapow
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
Eleanor Howe
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
Josef Scheiber
 
Clinical Data and AI
Clinical Data and AIClinical Data and AI
Clinical Data and AI
Stefano Paluello
 

Similar to Big Data & ML for Clinical Data (20)

Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Introduction to machine_learning_us
Introduction to machine_learning_usIntroduction to machine_learning_us
Introduction to machine_learning_us
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...
 
(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical Devices
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in Healthcare
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
Clinical Data and AI
Clinical Data and AIClinical Data and AI
Clinical Data and AI
 

More from Paul Agapow

Can drug repurposing be saved with AI 202405.pdf
Can drug repurposing be saved with AI 202405.pdfCan drug repurposing be saved with AI 202405.pdf
Can drug repurposing be saved with AI 202405.pdf
Paul Agapow
 
IA, la clave de la genomica (May 2024).pdf
IA, la clave de la genomica (May 2024).pdfIA, la clave de la genomica (May 2024).pdf
IA, la clave de la genomica (May 2024).pdf
Paul Agapow
 
Digital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdfDigital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdf
Paul Agapow
 
How to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdfHow to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdf
Paul Agapow
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trust
Paul Agapow
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicine
Paul Agapow
 
Multi-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gainMulti-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gain
Paul Agapow
 
ML & AI in pharma: an overview
ML & AI in pharma: an overviewML & AI in pharma: an overview
ML & AI in pharma: an overview
Paul Agapow
 
ML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the icebergML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the iceberg
Paul Agapow
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
Paul Agapow
 
Get yourself a better bioinformatics job
Get yourself a better bioinformatics jobGet yourself a better bioinformatics job
Get yourself a better bioinformatics job
Paul Agapow
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
Paul Agapow
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational research
Paul Agapow
 
Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)
Paul Agapow
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
Paul Agapow
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)
Paul Agapow
 
Patient subtypes: real or not?
Patient subtypes: real or not?Patient subtypes: real or not?
Patient subtypes: real or not?
Paul Agapow
 
Big biomedical data is a lie
Big biomedical data is a lieBig biomedical data is a lie
Big biomedical data is a lie
Paul Agapow
 
eTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, LondoneTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, London
Paul Agapow
 
Introduction to Snakemake
Introduction to SnakemakeIntroduction to Snakemake
Introduction to Snakemake
Paul Agapow
 

More from Paul Agapow (20)

Can drug repurposing be saved with AI 202405.pdf
Can drug repurposing be saved with AI 202405.pdfCan drug repurposing be saved with AI 202405.pdf
Can drug repurposing be saved with AI 202405.pdf
 
IA, la clave de la genomica (May 2024).pdf
IA, la clave de la genomica (May 2024).pdfIA, la clave de la genomica (May 2024).pdf
IA, la clave de la genomica (May 2024).pdf
 
Digital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdfDigital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdf
 
How to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdfHow to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdf
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trust
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicine
 
Multi-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gainMulti-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gain
 
ML & AI in pharma: an overview
ML & AI in pharma: an overviewML & AI in pharma: an overview
ML & AI in pharma: an overview
 
ML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the icebergML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the iceberg
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
Get yourself a better bioinformatics job
Get yourself a better bioinformatics jobGet yourself a better bioinformatics job
Get yourself a better bioinformatics job
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational research
 
Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)
 
Patient subtypes: real or not?
Patient subtypes: real or not?Patient subtypes: real or not?
Patient subtypes: real or not?
 
Big biomedical data is a lie
Big biomedical data is a lieBig biomedical data is a lie
Big biomedical data is a lie
 
eTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, LondoneTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, London
 
Introduction to Snakemake
Introduction to SnakemakeIntroduction to Snakemake
Introduction to Snakemake
 

Recently uploaded

Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
yusufzako14
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 

Recently uploaded (20)

Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 

Big Data & ML for Clinical Data

  • 1. Big Data & Machine Learning for Clinical Data Paul Agapow <p.agapow@imperial.ac.uk> Data Science Institute, Imperial College London
  • 2.  Biomedical science is now data science  I was a biochemist, immunologist, and then a infectious disease bioinformatician  I’m now a “biomedical data scientist”  I will be a Health Informatics Director at AstraZeneca About me & these lectures WikiMedia Commons
  • 3.  We increasingly use & need:  Lots of complex data  Real world evidence (outside RCTs)  Computational tools  Statistical analysis  Complex interactions  Precision medicine: prediction & (sub)typing  Also:  Cheap  Successful in other domains  But lots of hype and jargon Biomedical science is now data science WikiMedia Commons
  • 4.  The world is increasingly “datafied” – we make more and bigger datasets  Devices  Routine collection  Aggregation & integration  Big Data is “too big”for conventional approaches Part 1: Big Data WikiMedia Commons
  • 5.  “Quantity has a quality of its own”  Often free  Real  Rich, deep, interactions  Needed for ML and other assumption-light approaches Why Big Data? By Ender005 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49888192
  • 6.  Many diseases with the same clinical presentation have different molecular phenotypes  Several overlapping terms  stratified: separate patients into groups for treatment  precision:  tailor treatment to individual  improved targeted therapies with fewer side effects  “Right medication, right dose, right patient, right time, right route”  Also personalised, P4 …  E.g. asthma Why Big Data? Precision medicine
  • 7.  Volume  Velocity  Variety  Veracity  Value The 3 / 4 / 5 Vs of Big Data By MuhammadAbuHijleh - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=46431834
  • 8.  Limits labile to technological progress  Memory  Compute  Data schema  Solutions: distributed & parallel computation, new high-end databases The problem with volume: tools & platforms WikiMedia Commons
  • 9.  Multiple hypothesis testing and false discovery  Bias: a sample is not the population  The Past is not the Present  Observation without understanding  The curse of dimensionality  Privacy  Some ML-specific issues The problem with volume: methodology From KDNuggets
  • 10.  Many, many types of data  How do we use multiple types?  Which type do we use?  Disease is systemic  Interactions  Evidence  Solutions: integrated analysis, independent analysis with validation The problem with variety Wu, Sanin, Wang (2016) Clinical Applications and Systems Biomedicine
  • 11.  Much biodata is uncertain  Noise  Mistakes  People lie  A sample is not a population  Incompatible systems  Most analyses are not reproducible  Solutions: imputation, standards, cross-validation etc. The problem with veracity By Khaydock - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25102900
  • 12.  How do we  Re-use data  Compare data  Store data from multiple sources  Even know what data is  FAIR, OHDSI / OMOPS, HPO  Even just metadata helps for cataloguing  But: multiple & incomplete standards, translation, complexity Solution: Standards & ontologies WikiMedia Commons
  • 13.  Much data cannot leave its home institution  Hospitals  Registries  Insurance companies  Governance is hard & slow  So take the analysis to the data  Data looks the same but may be internally different Solution: Federated analysis International Collaboration for Autism Registry Epidemiology
  • 14.  In a vast sea of biodata, how do you discover anything? How do you avoid cherry-picking?  Solutions:  Distinguish discovery from exploration  Non-parametric methods (e.g. machine learning)  Some problems don’t have a single solution but many (e.g. prediction) The problem with it all: discoverability EnterpriseKnowledge.com
  • 15.  Write analyses as recipes  Snakemake, Nextflow, Flowr  Use recreatable computational systems  Docker  “Your biggest collaborator is you, six months ago”  But: it’s work Solution: Reproducibility From RevolutionR
  • 16.  Big Data is “too big” for current conventional tools & practices  But it’s ideal for solving many biomedical problems  There are problems with valid discovery and just handling the data  Standards, distributed databases and analysis and Summary: Big Data
  • 17.  “a field of Artificial Intelligence”  “(the science of) getting computers to learn and act like humans do”  “getting computers to act without being explicitly programmed”  “computer systems that automatically improve with experience”  “neural networks”  “using statistical techniques to give computer systems the ability to learn” Part 2: Machine Learning
  • 18. In practice:  broadly-defined set of algorithms that recognise & generalise patterns in data  “non-parametric” or assumption-light  may require training over initial dataset What is Machine Learning? By Chire - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=11711077
  • 19.  Enough data  Enough compute  Technical progress  Need 'good enough' solutions  Prediction & forecasting  Categorization  Pattern recognition  Early, startling success Why now? Ray Kurzweil The Singularity is Near
  • 20. How is ML different to stats?
  • 21. How is ML different to stats? Statistical Machine Assumptions strong weak Data small large Optimize by fitting training Solutions “the best” “good enough” Hypothesis proof exploration Test p-values etc. validation
  • 22. In practice:  a field of scientific research  machine learning  neural networks  deep learning  more of an objective than a methodology  computational systems that duplicate / emulate / replace human effort What is Artificial Intelligence
  • 23. • Many methods • Broadly split into: • Unsupervised: finds structure within data • e.g. (most) clustering, self-organised maps, principal component analysis • Supervised: trained using labelled examples • e.g. regression, decision trees, naive bayes, neural networks • Categories can blur • e.g. k-means, nearest neighbour? • Which is better? What are ML methods?
  • 24. • (Train a model from data) • This model encapsulates or generalizes the data • (Validate the model against test data) • This model transforms features into labels • Continuous outputs (e.g. real numbers) are regressions • Discrete outputs (e.g. categories) are classifications ML terms & process
  • 25. • Take gene expression profiles from patients and cluster to: • See genes with similar expression profiles • Similar patients • Train a model on radiographs with tumours labelled, use to diagnose unlabelled images • Find patients with similar symptoms & signs (computational phenotypes) in HER • Train on histories of patients to forecast their future condition • Find out how terms in a medical corpus relate to each other Examples of ML
  • 28.  What does ‘similar’ mean? How do we measure it?  Which features & how weighted?  Noise & overlapping clusters  Non-numeric, non-ordered data  What shapes can clusters be?  How many clusters? When do we stop?  … Clustering isn’t simple By Chire - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17085331
  • 29. Varies but:  Start with record-feature matrix  Normalise data  (“Supervised”: select number of clusters)  Run algorithm  Validate Clustering process WikiMedia Commons
  • 30. How not to do it
  • 31.  A cluster partitioning is a hypothesis  How do we assess? Validate:  External: compare against external label or data  e.g. accuracy, entropy  Internal: goodness of clustering  e.g. sum squared errors, cluster cohesion & separation, silhouette  Relative: against another clustering scheme  e.g. is this better with 3 or 4 clusters Validating clusters
  • 32. Average over each point: 1. Calculate the average distance to all other members of its cluster, a 2. For each other cluster, calculate the average distance to every member. The minimum of these is b 3. The silhouette width is (b−a) / max(a,b), the higher the better Clustering process
  • 33. What if there are sub-clusters or structure? • Use hierarchical clustering • Use homogeneity or completeness metrics to compare Nesting & hierarchies
  • 34. • Complex, heterogeneous disease • Many attempts at clustering • Use transcriptomic & proteomic data • Validate with clinical • 4 clusters with characteristic genes & clinical behaviour Example: asthma
  • 35.  a.k.a. deep learning, (artificial) neural networks, “AI”  A series of layers of nodes, each of which transforms the previous layer.  Training sets weights on transformations  Capable of learning representations Supervised learning: deep networks WikiMedia Commons
  • 36.  There’s little information in an individual pixel (gene, data point …)  But individual data points make up more complete entities  Each layer takes the layer below and creates higher-level entities (representations) from it.  The system “recognises” higher- level features that can appear anywhere in the data. What’s a representation? WikiMedia Commons
  • 37.  Radiologists are overwhelmed  Want to catch errors & double-check  Train ANN over medical imagery with tumour labelled  Accuracy similar to humans Example: diagnosis from medical imagery From Nvidia
  • 38. • The model is right but learns the wrong thing (from our point of view) • Solutions: • Interpreting models • Better (more examined) data Problem: useless solutions Ribeiro et al. (2016) Why Should I Trust You?
  • 39.  Reversing the model & asking “why”  What features are important  Mechanistic insight  But many ML models are tangled & horribly complex  And ML community often uninterested  Solutions:  Choose an intepretable model  Software that explores feature space (LIME, Lift, IML) Problem: interpretability
  • 40. • Bias (systematic error) vs. Variance (random error) • Want a model that captures the regularities in training data AND generalizes to unseen data. • This is impossible • Solutions: • Use a variety of data • Feature selection • Regularization Problem: how do models get it wrong? From KDNuggets
  • 41. • What do we want from our ML models? • Power / accuracy • Insight • Error tolerance • e.g. drug discovery vs drug safety Problem: how good do models have to be? After Harel
  • 42. • Much (most) data has few positives • Results in an imbalanced model • Solutions: • Over- and under-sampling • Pre-train with poor data • Ensemble methods Problem: imbalanced data & lack of data DataScience.com
  • 43.  Machine learning uses large amounts of data with few assumptions to make models that generalise that data  This is useful for situations where we don’t have an explicit model and just need ‘a’ solution.  But this means we need to examine our data and validate our solutions  A ‘bad’ solution can be useful, depending on what you want to achieve. Summary: Machine Learning