Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014


Published on

Today's world is full of data that is easily accessible for anyone. The problem now is how to make sense of this data and extract some useful insights from the terabytes of raw material. Typically, this involves using machine learning tools - allowing you to build classifiers, cluster data, etc. Many of these approaches give you models that describe the data accurately, but may be difficult to interpret. If you want to be able to understand the result more intuitively it is worth looking at Bayesian Networks - a graphical representation that simplifies complex mathematical model into a most likely graph of dependencies between your variables. I will talk about BNFinder - a python library allowing you to take any tabular data and convert it to a much simplified representation of conditional dependencies between variables. It can be the used for classification of unseen objects while the connection structure can be interpreted even by a non specialist. BNfinder is publicly available under GNU GPL and it can be used by anyone on their data.

Published in: Technology, Education
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

  1. 1. Understanding your data with Bayesian networks (in python) Bartek Wilczyński bartek@mimuw.edu.pl University of Warsaw PyData Silicon Valey, May 5th 2014
  2. 2. Are you confused enough? Or should I confuse you a bit more ? Image from xkcd.org/552/
  3. 3. Data show: Confused students score better! Data from Eric Mazur
  4. 4. There may be factors we haven't thought about ● Maybe confusion helps with learning? ● Or maybe there is an alternative explanation? ● As long as these are just cartoon models – we cannot really rule out any structure Paying attention Being confused Correct answer Being confused Correct answer or
  5. 5. What do I mean by data? Sex Age Smoking Stress Lung Heart Feel M 0-20 never N No no great F 70 sometimes N minor no OK M 50-70 daily Y no severe Not-so-well M 20-50 daily N no minor OK F 70 never N no minor great F 20-50 sometimes Y severe minor Not-so-well F 20-50 never Y no no great M 20-50 sometimes N minor no great M 50-70 never Y severe no OK F 0-20 never N no severe OK M 20-50 daily Y no no OK M 0-20 daily N no no Not-so-well M 20-50 never N minor no OK .... ... ... ... ... ... ...
  6. 6. Network of connections Smoking (daily, sometimes, never) Age (0-20,20-50, 50-70,70+) Stressful job (yes,no) Lung problems (no,minor,severe) Heart problems (no,minor,severe) Sex (male,female) How did you feel this morning? (great, OK, not-so-well, terrible)
  7. 7. What is a Bayesian Network ? ● A directed acyclic graph without cycles ● with nodes representing random variables ● and edges between nodes representing dependencies (not necessarily causal) ● Each edge is directed from a parent to a child, so all nodes with connections to a given node constitute its set of parents ● Each variable is associated with a value domain and a probability distribution conditional on parents' values
  8. 8. Back to our confused students ● Let us consider our model of confused students ● We can consider the model with an additional variable ● We need to heve data on the additional variable to be predictive ● Sometimes we need to use “wrong” models if they are predictive Paying attention Being confused Correct answer Paying attention yes no confused 80% 0% not confused 20% 100% Paying attention Being confused Correct answer Paying attention yes no correct 50% 20% incorrect 50% 80%
  9. 9. Can we find the “best” Bayesian Network? ● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets) ● In order to do it automatically we need a scoring function to define what we mean by “best” ● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)
  10. 10. How to find the best network? ● There are generally three main approaches to defining BN scores: – Bayesian statistics, e.g. BDe (Herskovits et al. '95) – Information Theoretic, e.g. MDL (Lam et al. '94) – Hypothesis testing, e.g. MMPC (Salehi et al. '10) ● There are also hybrid approaches, like the recent MIT (de Campos '06) approach that uses information theory and hypothesis testing ● We have two issues: – There are exponentially many potential parent sets – The desired network needs to have no cycles ● The second issue is more important and makes the problem NP-complete (Chickering '96)
  11. 11. Cycles are not always a problem ● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies ● We can safely assume that the causal links only go forward in time ● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after” X1 X2 X3 X1 X1 t t+1 X2 X2 X3 X3
  12. 12. Different types of variables ● Another common situation is when we have different types of variables ● We may know that only certain types of connections are causal ● Or we may be interested only in certain types of connections ● This breaks the cycles as well Mutations Protein expression Diseases
  13. 13. BNFinder – python library for Bayesian Networks ● A library for identification of optimal Bayesian Networks ● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks) ● fast and efficient (relatively)
  14. 14. Example1 – the simplest possible
  15. 15. Now, parallellize! ● Since we have external constraints on acyclicity, we can search for parent sets independently ● This leads to a simple parallelization scheme and good efficiency
  16. 16. Bonn et al. Nat. Genet, 2012
  17. 17. Active Inactive
  18. 18. Making the training set for “activity” variable
  19. 19. Handling continuous data
  20. 20. Network model
  21. 21. Does it provide useful predictions? • 12 positive and 4 negative predictions tested • >90% success (1 error)
  22. 22. Some more continuous data with perturbations
  23. 23. • 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks) • Activity data for ~140 enhancers divided into – 3 tissues (MESO, VM, SM) – 5 stages (4-6,7-8,9-10,1112,13-16) • Gene expression data for 5082 genes from the BDGP database Wilczynski et al.PLoS Comp.Biol 2012
  24. 24. Predictions validated: 19/20 correct stage, 10/20 correct tissue
  25. 25. Summary ● Bayesian Networks can provide predictive models based on conditional probability distributions ● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source! ● It can be used as a commandline tool or as a library ● It can use continuous data as well as discrete ● Can be run in parallel on multiple cores (with good efficiency) ● Convenience functions (cross-validation, ROC plots) included http://launchpad.net/bnfinder
  26. 26. Thanks! ● Norbert Dojer ● Alina Frolova ● Paweł Bednarz ● Agnieszka Podsiadło ● Questions?