Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Introduction to Bayesian Networks -... by Bayesia USA 2795 views
- Bayesian Belief Networks for dummies by Gilad Barkan 4250 views
- Understanding your data with Bayesi... by PyData 8745 views
- Bayesian Networks with R and Hadoop by Hadoop Summit 1284 views
- Bayesian Networks - A Brief Introdu... by adnanmasood 3482 views
- Page rank for anomaly detection - B... by Ofer Mendelevitch 510 views

8,575 views

8,287 views

8,287 views

Published on

Slides from Hadoop Summit 2014 - Bayesian Networks with R and Hadoop

No Downloads

Total views

8,575

On SlideShare

0

From Embeds

0

Number of Embeds

364

Shares

0

Downloads

307

Comments

0

Likes

28

No embeds

No notes for slide

Set of nodes

Arrows between the nodes (graph)

Conditional probability table: P(X | parents) for each node

Set of nodes

Arrows between the nodes (graph)

Conditional probability table: P(X | parents) for each node

Set of nodes

Arrows between the nodes (graph)

Conditional probability table: P(X | parents) for each node

- 1. © Hortonworks Inc. 2014 Hortonworks Bayesian Networks with R and Hadoop Hadoop Summit, June 2014 Ofer Mendelevitch
- 2. © Hortonworks Inc. 2014 Page 2 A bit about me Ofer Mendelevitch Director, Data Science @ Hortonworks Previously: Nor1, Yahoo!, Risk Insight, Quiver Personal blog: www.achessdad.com
- 3. © Hortonworks Inc. 2014 Page 3 What I will cover today… •What is a Bayesian Network? •Why I think it’s cool •Bayesian networks with R: the bnlearn package •Bayes Networks Inference with R and Hadoop
- 4. © Hortonworks Inc. 2014 Page 4 Introduction to Bayesian Networks (with examples using R)
- 5. © Hortonworks Inc. 2014 Page 5 Example: “Asia” Bayesian Network Each node is a random variable: yes/no Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath
- 6. © Hortonworks Inc. 2014 Page 6 Example: “Asia” Bayesian Network Graph structure reflects “causal” relationships Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath
- 7. © Hortonworks Inc. 2014 Page 7 Example: “Asia” Bayesian Network node CPT: P(node | parents) Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath SoB Tub or Cancer Bronchitis T F T T 0.7 0.3 F T 0.4 0.6 T F 0.45 0.55 F F 0.05 0.95 CPT
- 8. © Hortonworks Inc. 2014 Page 8 What is a (discrete) Bayesian Network? (also called Bayes Nets, Belief Nets, etc) • A network structure (DAG): – Nodes => random variables, taking discrete values – Edges => conditional dependencies • E.g., lung cancer is statistically dependent on smoking • A set of conditional probability tables (CPTs): – Each node has a set of parents, determined by the graph – CPT holds P(node | parent-A, parent-B, …) for each node
- 9. © Hortonworks Inc. 2014 Page 9 Why are Bayesian Networks cool? • Intuitive/adaptive modeling tool: – Graphs are natural for modeling relationships – Easy to combine data-driven learning with expert know-how – You can start small, and add knowledge as it is acquired • “Naturally” addresses inference with missing values • Inference can be applied to any variable/node – As opposed to a single (target) variable in supervised learning
- 10. © Hortonworks Inc. 2014 Page 10 Bayesian networks have been successfully used for a variety of real-world applications • Healthcare: medical diagnosis, genetic modeling • Security: crime pattern analysis, terrorism risk management • Education: student modeling • Finance: credit rating, predicting defaults • Tech support: troubleshooting for computers/printers See “Bayesian networks: a practical guide to applications”, Pourret et al
- 11. © Hortonworks Inc. 2014 Page 11 Bayesian networks with R • http://cran.r-project.org/web/views/Bayesian.html • We will focus on “bnlearn” (by Marco Scutari) – Implements various structure learning algorithms (hc, tabu, gs, iamb, mmhc, rsmax2, etc) – Provides automated learning of CPT – Approximate inference: “likelihood sampling” and “likelihood weighting” – Supports snow/parallel for some algorithms
- 12. © Hortonworks Inc. 2014 Page 12 Step 1: Constructing the graph Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath • Manually (expert knowledge) • Automatically from data
- 13. © Hortonworks Inc. 2014 Page 13 Manual graph construction: Asia > library(bnlearn) > varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB") > ag = empty.graph(varnames) > arcs(ag, ignore.cycles=T) = data.frame( > "from”=c("Asia", "Smoking", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC"), > "to”=c("Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC", "SoB", "X-ray", "SoB")) > graphviz.plot(ag)
- 14. © Hortonworks Inc. 2014 Page 14 Automated graph construction: Asia > library(bnlearn) > varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB") > data(asia); names(asia) = varnames > bg = hc(asia) > graphviz.plot(bg)
- 15. © Hortonworks Inc. 2014 Page 15 Automated learning does not always work perfectly… For example: • May not learn all the “expected” edges • May learn in the wrong direction Therefore, in practice it helps to: • Provide whitelist and blacklist to the algorithm • Pre-seed with a manual networks structure, and let the algorithm learn from there • Ensemble learning of structure (see boot.strength)
- 16. © Hortonworks Inc. 2014 Page 16 Step 2: Learning the CPT / probabilities Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath SoB Tub or Cancer Bronchitis T F T T 0.85 0.15 F T 0.79 0.21 T F 0.73 0.27 F F 0.1 0.9 CPT
- 17. © Hortonworks Inc. 2014 Page 17 Learning CPT for each node in the graph > fitted = bn.fit(ag, asia) > print(fitted$SoB) Parameters of node SoB (multinomial distribution) Conditional probability table: , , Tub-or-LC = no Bronchitis SoB no yes no 0.90017286 0.21373057 yes 0.09982714 0.78626943 , , Tub-or-LC = yes Bronchitis SoB no yes no 0.27737226 0.14592275 yes 0.72262774 0.85407725
- 18. © Hortonworks Inc. 2014 Page 18 Using the BN for inference • Given evidence: (1) visit to asia, (2) SoB (3) Bronchitis • What is the likelihood of “lung cancer”? Visit to Asia Smoking Tuberculosis Lung cancer Bronchitis Tuberculosis or cancer X-ray result Shortness of breath
- 19. © Hortonworks Inc. 2014 Page 19 Inferring with missing values • We provide evidence (“yes” or “no” in this case) only for those nodes where we have such evidence • If a value is “missing” it’s just not included in the evidence when doing inference… This is in contrast to supervised learning, where ALL values are typically needed for inference.
- 20. © Hortonworks Inc. 2014 Page 20 Exact Inference with gRain • The gRain package implements exact inference for discrete Bayesian Networks using the “Junction Tree” belief propagation algorithm • Bnlearn/gRain cooperate nicely > jtree = compile(as.grain(fitted)) > jp = setFinding(jtree, nodes = c("Asia", "Sob", "Bronchitis"), states = c("yes", "yes", "yes")) > print(querygrain(jp, nodes="LC")$LC) LC no yes 0.934 0.066
- 21. © Hortonworks Inc. 2014 Page 21 Approximate inference with bnlearn Bnlearn implements approximate inference: logic sampling (aka rejection sampling) and likelihood weighting > # Infer probability P(SoB | Asia, Bronchitis) using logic sampling > p1 = cpquery(fitted, event = eval(SoB == 'yes'), evidence = eval(Asia == 'yes' & Bronchitis == 'yes'), method="ls") > print(p1) [1] 0.8014706 > # Infer probability P(SoB | Asia, Bronchitis) using likelihood weighting > evidence = list("yes", "yes") > names(evidence) = c("Asia", "Bronchitis") > p2 = cpquery(fitted, eval(SoB == 'yes'), evidence, method="lw") > print(p2) [1] 0.795404
- 22. © Hortonworks Inc. 2014 Page 22 Large scale Bayes Networks Inference with R and Hadoop
- 23. © Hortonworks Inc. 2014 Page 23 What is large? • Number of nodes: – 10s: Medium – 100s: Large – 1000s: Very large • Number of instances: – 100,000s to millions
- 24. © Hortonworks Inc. 2014 Page 24 Manually constructing large graphs is hard
- 25. © Hortonworks Inc. 2014 Page 25 Large scale learning in practice: manual + automated • Define nodes • Seed with some known edges, based on expert knowledge • Augment with automated learning (e.g., hc, tabu, rsmax2, etc)
- 26. © Hortonworks Inc. 2014 Page 26 Large scale inference: Exact or Approximate? Pros Cons Exact (Jtree) gRain Fast inference time Computational complexity determined (exponentially) by largest clique size Approximate (LS, LW) Bnlearn Can be used for any graph Not limited by “clique” size Inference is often much slower Not accurate for rare events
- 27. © Hortonworks Inc. 2014 Page 27 About RHadoop/RMR • An open source project, supported by revolution analytics • Various sub-projects: RMR, RHDFS, RHBASE, plyrmr, etc • We will focus on RMR – Implement mapper/reducer code using R • RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki • Installing RMR on HDP: http://www.slideshare.net/Hadoop_Summit/enabling-r-on- hadoop http://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/
- 28. © Hortonworks Inc. 2014 Page 28 Large scale inference with R and Hadoop Infer with RMR BN model Mapper No-op Results Hadoop cluster RMR Mapper No-op Chunk 1 Chunk N Chunk 2 Instances ﬁle Reducer CPQuery Reducer CPQuery Reducer CPQuery Inference is embarrassingly parallel Hadoop determines # of mappers, based on file size SO we’ll use reducers to parallelize CPQuery
- 29. © Hortonworks Inc. 2014 Page 29 Example: Adult dataset • Donated by Ronny Kohavi and Barry Becker, 1996 - http://archive.ics.uci.edu/ml/datasets/Adult • Extracted from 1994 census data • 48842 instances, 14 features such as: – Age, country, occupation, marital status, capital gain, etc – Goal: predict if income is >50K or not … 53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K 28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K 37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K 49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K 52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K …
- 30. © Hortonworks Inc. 2014 Page 30 Sample learned network structure for “adult”
- 31. © Hortonworks Inc. 2014 Page 31 Inference with RMR on adult dataset NUM_REDUCERS = 4 opt = rmr.options(backend = "hadoop”, backend.parameters = list(hadoop=list(D="mapreduce.reduce.memory.mb=1024", D=paste0("mapreduce.job.reduces=”, NUM_REDUCERS)))) inpFile = 'adult.test' outFile = 'adult.out' mapreduce(input=inpFile, input.format="text", output=outFile, output.format="csv", map=map_func, reduce=reduce_func)
- 32. © Hortonworks Inc. 2014 Page 32 Our mapper: passing on to reducer… map_func <- function(., values) { out_klist= list(); out_vlist = list() for (v in values) { fvec = unlist(strsplit(v, ',', fixed=T)) # Read row and split into columns if (length(fvec)<15) { next; } # deal with row not in expected format key = floor(runif(1, 0, NUM_REDUCERS)) out_klist = c(out_klist, key) out_vlist = c(out_vlist, v) } return (keyval(out_klist, out_vlist)) }
- 33. © Hortonworks Inc. 2014 Page 33 Our reducer: where all the action happens trim <- function (x) gsub("^s+|s+$", "", x) reduce_func <- function(., values) { out_klist = list(); out_vlist = list() for (v in values) { increment.counter('bn-demo', 'row', 1) # to let MR know we are still active fvec = sapply(strsplit(v, ',', fixed=T), trim) # read row and split into columns names(fvec)=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income") pv = dataprep(fvec) # transform to “learned” features evidence = as.list(pv[1,setdiff(colnames(pv), 'income')]) prob = cpquery(fitted, event = (income == ">50K"), evidence = evidence, method="lw") out_klist = c(out_klist, v) out_vlist = c(out_vlist, format(prob, digits=2)) } return (keyval(out_klist, out_vlist)) }
- 34. © Hortonworks Inc. 2014 Page 34 Example output: adult.out 26, Private, 191573, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K.,0.37 52, Private, 203635, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K.,0.14 36, Private, 68798, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.,0.019 34, Private, 31752, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, ?, <=50K.,0.14 59, ?, 291856, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K.,0.074 26, Private, 135848, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 10, Guatemala, <=50K.,0.03 50, Local-gov, 237356, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.89 56, Self-emp-not-inc, 140729, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K.,0.14 22, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K.,0.21 45, Self-emp-inc, 88500, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.94
- 35. © Hortonworks Inc. 2014 Page 35 More information • Detailed step-by-step guide and code used can be found on: https://github.com/ofermend/bayes-net-r-hadoop • Download Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ • Further reading/learning: – http://www.bnlearn.com/ – PGM class on Coursera: https://www.coursera.org/course/pgm – PGM Ebook from UCL: http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf – Many others…
- 36. © Hortonworks Inc. 2014 Page 36 Thank you! Any Questions? Ofer Mendelevitch, ofer@hortonworks.com, @ofermend We’re hiring! www.hortonworks.com/careers Hortonworks training: www.hortonworks.com/training Hortonworks blog: www.hortonworks.com/blog

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment