SlideShare a Scribd company logo
Understanding your data
with Bayesian networks
(in python)
Bartek Wilczyński
bartek@mimuw.edu.pl
University of Warsaw
PyData Silicon Valey, May 5th 2014
Are you confused enough?
Or should I confuse you a bit more ?
Image from xkcd.org/552/
Data show: Confused students score better!
Data from Eric Mazur
There may be factors we haven't thought about
● Maybe confusion helps
with learning?
● Or maybe there is
an alternative explanation?
● As long as these are just
cartoon models – we
cannot really rule out any
structure
Paying
attention
Being
confused
Correct
answer
Being
confused
Correct
answer
or
What do I mean by data?
Sex Age Smoking Stress Lung Heart Feel
M 0-20 never N No no great
F 70 sometimes N minor no OK
M 50-70 daily Y no severe Not-so-well
M 20-50 daily N no minor OK
F 70 never N no minor great
F 20-50 sometimes Y severe minor Not-so-well
F 20-50 never Y no no great
M 20-50 sometimes N minor no great
M 50-70 never Y severe no OK
F 0-20 never N no severe OK
M 20-50 daily Y no no OK
M 0-20 daily N no no Not-so-well
M 20-50 never N minor no OK
.... ... ... ... ... ... ...
Network of connections
Smoking
(daily, sometimes, never)
Age
(0-20,20-50, 50-70,70+)
Stressful job
(yes,no)
Lung problems
(no,minor,severe)
Heart problems
(no,minor,severe)
Sex
(male,female)
How did you feel this morning?
(great, OK, not-so-well, terrible)
What is a Bayesian Network ?
●
A directed acyclic graph without cycles
●
with nodes representing random variables
●
and edges between nodes representing dependencies
(not necessarily causal)
●
Each edge is directed from a parent to a child, so all
nodes with connections to a given node constitute its
set of parents
●
Each variable is associated with a value domain and a
probability distribution conditional on parents' values
Back to our confused students
● Let us consider our model of
confused students
● We can consider the model
with an additional variable
● We need to heve data on the
additional variable to be
predictive
● Sometimes we need to use
“wrong” models if they are
predictive
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
confused 80% 0%
not confused 20% 100%
Paying
attention
Being
confused
Correct
answer
Paying attention
yes no
correct 50% 20%
incorrect 50% 80%
Can we find the “best” Bayesian Network?
● Given a dataset with observations,
we can try to find the “best”
network topology (i.e. the best
collection of parents' sets)
● In order to do it automatically we
need a scoring function to define
what we mean by “best”
● A score function is useful if it can
be written as a sum over
variables, i.e. the best network
consists of best parent sets for
variables (modulo acyclicity)
How to find the best network?
● There are generally three main approaches to defining BN scores:
– Bayesian statistics, e.g. BDe (Herskovits et al. '95)
– Information Theoretic, e.g. MDL (Lam et al. '94)
– Hypothesis testing, e.g. MMPC (Salehi et al. '10)
● There are also hybrid approaches, like the recent MIT (de Campos '06)
approach that uses information theory and hypothesis testing
● We have two issues:
– There are exponentially many potential parent sets
– The desired network needs to have no cycles
● The second issue is more important and makes the problem NP-complete
(Chickering '96)
Cycles are not always a problem
● Dynamic Bayesian
Networks are avariant of
BN models that describe
temporal dependencies
● We can safely assume that
the causal links only go
forward in time
● That breaks the problem of
cycles as we now have two
versions of each variable:
“before” and “after”
X1
X2
X3
X1 X1
t t+1
X2 X2
X3 X3
Different types of variables
● Another common situation is
when we have different types
of variables
● We may know that only
certain types of connections
are causal
● Or we may be interested only in
certain types of connections
● This breaks the cycles as well
Mutations
Protein expression
Diseases
BNFinder – python library for Bayesian Networks
● A library for identification of
optimal Bayesian Networks
● Works under assumption of
acyclicity by external
constraints (disjoint sets of
variables or dynamic
networks)
● fast and efficient (relatively)
Example1 – the simplest possible
Now, parallellize!
● Since we have external
constraints on acyclicity, we
can search for parent sets
independently
● This leads to a simple
parallelization scheme and
good efficiency
Bonn et al. Nat. Genet, 2012
Active Inactive
Making the training set for “activity” variable
Handling continuous data
Network model
Does it provide useful predictions?
• 12 positive and 4 negative predictions tested
• >90% success (1 error)
Some more continuous data with perturbations
• 8008 enhancers compiled
from 15 ChIP experiments
(almost 20k binding peaks)
• Activity data for ~140
enhancers divided into
– 3 tissues (MESO, VM, SM)
– 5 stages
(4-6,7-8,9-10,1112,13-16)
• Gene expression data for
5082 genes from the BDGP
database
Wilczynski et al.PLoS Comp.Biol 2012
Predictions validated:
19/20 correct stage, 10/20 correct tissue
Summary
● Bayesian Networks can provide predictive models based on
conditional probability distributions
● BNFinder is an effective tool for finding optimal networks given
tabular data. And it's open source!
● It can be used as a commandline tool or as a library
● It can use continuous data as well as discrete
● Can be run in parallel on multiple cores (with good efficiency)
● Convenience functions (cross-validation, ROC plots) included
http://launchpad.net/bnfinder
Thanks!
● Norbert Dojer
● Alina Frolova
● Paweł Bednarz
● Agnieszka Podsiadło
● Questions?

More Related Content

What's hot

Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
Krishnaram Kenthapadi
 

What's hot (20)

Towards Causal Representation Learning
Towards Causal Representation LearningTowards Causal Representation Learning
Towards Causal Representation Learning
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Confusion Matrix
Confusion MatrixConfusion Matrix
Confusion Matrix
 
Understanding feature-space
Understanding feature-spaceUnderstanding feature-space
Understanding feature-space
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Learning from imbalanced data
Learning from imbalanced data Learning from imbalanced data
Learning from imbalanced data
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Lecture 6 expert systems
Lecture 6   expert systemsLecture 6   expert systems
Lecture 6 expert systems
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
 
Destek Vektör Makineleri - Support Vector Machine
Destek Vektör Makineleri - Support Vector MachineDestek Vektör Makineleri - Support Vector Machine
Destek Vektör Makineleri - Support Vector Machine
 
Linear algebra for deep learning
Linear algebra for deep learningLinear algebra for deep learning
Linear algebra for deep learning
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Loss Function.pptx
Loss Function.pptxLoss Function.pptx
Loss Function.pptx
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care
 

Similar to Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
rhetttrevannion
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can Learn
Kodok Ngorex
 

Similar to Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014 (20)

The current state of prediction in neuroimaging
The current state of prediction in neuroimagingThe current state of prediction in neuroimaging
The current state of prediction in neuroimaging
 
41 essential machine learning interview questions!
41 essential machine learning interview questions!41 essential machine learning interview questions!
41 essential machine learning interview questions!
 
Machine Learning Interview Questions Answers
Machine Learning Interview Questions AnswersMachine Learning Interview Questions Answers
Machine Learning Interview Questions Answers
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
AI Unit 5 machine learning
AI Unit 5 machine learning AI Unit 5 machine learning
AI Unit 5 machine learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
 
Machine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgeMachine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledge
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Lecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can LearnLecture7 Ml Machines That Can Learn
Lecture7 Ml Machines That Can Learn
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sig
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 

More from PyData

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 

Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

  • 1. Understanding your data with Bayesian networks (in python) Bartek Wilczyński bartek@mimuw.edu.pl University of Warsaw PyData Silicon Valey, May 5th 2014
  • 2. Are you confused enough? Or should I confuse you a bit more ? Image from xkcd.org/552/
  • 3. Data show: Confused students score better! Data from Eric Mazur
  • 4. There may be factors we haven't thought about ● Maybe confusion helps with learning? ● Or maybe there is an alternative explanation? ● As long as these are just cartoon models – we cannot really rule out any structure Paying attention Being confused Correct answer Being confused Correct answer or
  • 5. What do I mean by data? Sex Age Smoking Stress Lung Heart Feel M 0-20 never N No no great F 70 sometimes N minor no OK M 50-70 daily Y no severe Not-so-well M 20-50 daily N no minor OK F 70 never N no minor great F 20-50 sometimes Y severe minor Not-so-well F 20-50 never Y no no great M 20-50 sometimes N minor no great M 50-70 never Y severe no OK F 0-20 never N no severe OK M 20-50 daily Y no no OK M 0-20 daily N no no Not-so-well M 20-50 never N minor no OK .... ... ... ... ... ... ...
  • 6. Network of connections Smoking (daily, sometimes, never) Age (0-20,20-50, 50-70,70+) Stressful job (yes,no) Lung problems (no,minor,severe) Heart problems (no,minor,severe) Sex (male,female) How did you feel this morning? (great, OK, not-so-well, terrible)
  • 7. What is a Bayesian Network ? ● A directed acyclic graph without cycles ● with nodes representing random variables ● and edges between nodes representing dependencies (not necessarily causal) ● Each edge is directed from a parent to a child, so all nodes with connections to a given node constitute its set of parents ● Each variable is associated with a value domain and a probability distribution conditional on parents' values
  • 8. Back to our confused students ● Let us consider our model of confused students ● We can consider the model with an additional variable ● We need to heve data on the additional variable to be predictive ● Sometimes we need to use “wrong” models if they are predictive Paying attention Being confused Correct answer Paying attention yes no confused 80% 0% not confused 20% 100% Paying attention Being confused Correct answer Paying attention yes no correct 50% 20% incorrect 50% 80%
  • 9. Can we find the “best” Bayesian Network? ● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets) ● In order to do it automatically we need a scoring function to define what we mean by “best” ● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)
  • 10. How to find the best network? ● There are generally three main approaches to defining BN scores: – Bayesian statistics, e.g. BDe (Herskovits et al. '95) – Information Theoretic, e.g. MDL (Lam et al. '94) – Hypothesis testing, e.g. MMPC (Salehi et al. '10) ● There are also hybrid approaches, like the recent MIT (de Campos '06) approach that uses information theory and hypothesis testing ● We have two issues: – There are exponentially many potential parent sets – The desired network needs to have no cycles ● The second issue is more important and makes the problem NP-complete (Chickering '96)
  • 11. Cycles are not always a problem ● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies ● We can safely assume that the causal links only go forward in time ● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after” X1 X2 X3 X1 X1 t t+1 X2 X2 X3 X3
  • 12. Different types of variables ● Another common situation is when we have different types of variables ● We may know that only certain types of connections are causal ● Or we may be interested only in certain types of connections ● This breaks the cycles as well Mutations Protein expression Diseases
  • 13. BNFinder – python library for Bayesian Networks ● A library for identification of optimal Bayesian Networks ● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks) ● fast and efficient (relatively)
  • 14. Example1 – the simplest possible
  • 15. Now, parallellize! ● Since we have external constraints on acyclicity, we can search for parent sets independently ● This leads to a simple parallelization scheme and good efficiency
  • 16. Bonn et al. Nat. Genet, 2012
  • 18. Making the training set for “activity” variable
  • 21.
  • 22. Does it provide useful predictions? • 12 positive and 4 negative predictions tested • >90% success (1 error)
  • 23. Some more continuous data with perturbations
  • 24. • 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks) • Activity data for ~140 enhancers divided into – 3 tissues (MESO, VM, SM) – 5 stages (4-6,7-8,9-10,1112,13-16) • Gene expression data for 5082 genes from the BDGP database Wilczynski et al.PLoS Comp.Biol 2012
  • 25.
  • 26. Predictions validated: 19/20 correct stage, 10/20 correct tissue
  • 27. Summary ● Bayesian Networks can provide predictive models based on conditional probability distributions ● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source! ● It can be used as a commandline tool or as a library ● It can use continuous data as well as discrete ● Can be run in parallel on multiple cores (with good efficiency) ● Convenience functions (cross-validation, ROC plots) included http://launchpad.net/bnfinder
  • 28. Thanks! ● Norbert Dojer ● Alina Frolova ● Paweł Bednarz ● Agnieszka Podsiadło ● Questions?