Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advances in Bayesian Learning


Published on

  • Be the first to comment

  • Be the first to like this

Advances in Bayesian Learning

  1. 1. Machine Learning in Performance Management Irina Rish IBM T.J. Watson Research Center January 24, 2001
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Machine learning applications in Performance Management </li></ul><ul><li>Bayesian learning tools: extending ABLE </li></ul><ul><li>Advancing theory </li></ul><ul><li>Summary and future directions </li></ul>
  3. 3. Pattern discovery, classification, diagnosis and prediction Learning problems: examples System event mining Events from hosts Time End-user transaction recognition Remote Procedure Calls (RPCs) BUY? SELL? OPEN_DB? SEARCH? Transaction1 Transaction2
  4. 4. Approach: Bayesian learning Numerous important applications: <ul><li>Medicine </li></ul><ul><li>Stock market </li></ul><ul><li>Bio-informatics </li></ul><ul><li>eCommerce </li></ul><ul><li>Military </li></ul><ul><li>……… </li></ul>Diagnosis: P(cause| symptom )=? Learn (probabilistic) dependency models P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) Prediction: P(symptom| cause )=? Bayesian networks Pattern classification: P(class| data )=? C S B D X
  5. 5. Outline <ul><li>Introduction </li></ul><ul><li>Machine-learning applications in Performance Management </li></ul><ul><ul><ul><li>Transaction Recognition </li></ul></ul></ul><ul><ul><ul><li>In progress: Event Mining; </li></ul></ul></ul><ul><ul><ul><li>Probe Placement; etc. </li></ul></ul></ul><ul><li>Bayesian learning tools: extending ABLE </li></ul><ul><li>Advancing theory </li></ul><ul><li>Summary and future directions </li></ul>
  6. 6. End-User Transaction Recognition: why is it important? Client Workstation End-User Transactions (EUT) Remote Procedure Calls (RPCs) Server (Web, DB, Lotus Notes) Session (connection) <ul><ul><li>Realistic workload models (for testing performance) </li></ul></ul><ul><ul><li>Resource management (anticipating requests) </li></ul></ul><ul><ul><li>Quantifying end-user perception of performance (response times) </li></ul></ul>Examples: Lotus Notes, Web/eBusiness (on-line stores, travel agencies, trading): database transactions, buy/sell, search, email, etc. ? <ul><ul><li>OpenDB </li></ul></ul><ul><ul><li>Search </li></ul></ul><ul><ul><li>SendMail </li></ul></ul>RPCs
  7. 7. Why is it hard? Why learn from data? Example: EUTs and RPCs in Lotus Notes MoveMsgToFolder FindMailByKey 1. OPEN_COLLECTION 2. UDATE_COLLECTION 3. DB_REPLINFO_GET 4. GET_MOD_NOTES 5. READ_ENTRIES 6. OPEN_COLLECTION 7. FIND_BY_KEY 8. READ_ENTRIES EUTs RPCs <ul><li>Many RPC and EUT types (92 RPCs and 37 EUTs) </li></ul><ul><li>Large (unlimited) data sets (10,000+ Tx inst.) </li></ul><ul><li>Manual classification of a data subset took about a month </li></ul><ul><li>Non-deterministic and unknown EUT RPC mapping : </li></ul><ul><ul><li>“ Noise” sources - client/server states </li></ul></ul><ul><ul><li>No client-side instrumentation – unknown EUT boundaries </li></ul></ul>
  8. 8. Our approach: Classification + Segmentation (similar to text classification) (similar to speech understanding, image segmentation) Problem 2: both segment and label (EUT recognition) 1 2 1 3 4 1 2 3 1 2 1 3 2 3 1 2 1 2 3 1 2 1 2 4 Tx1 Tx3 Tx1 Tx3 Tx1 Tx3 Tx1 Tx3 Tx1 Tx3 Tx1 Tx3 Tx1 Tx3 Tx1 Tx3 Tx2 Tx2 Tx2 Tx2 Tx2 Tx2 Tx2 Tx2 Unsegmented RPC's Segmented RPC's and Labeled Tx's Tx2 Problem 1: label segmented data (classification) Labeled Tx's Segmented RPC's Tx3 Tx2 1 3 3 1 3 1 3 1 3 1 3 1 3 1 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 3 Tx1 Tx 1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Tx3 1 1 1 1 1 1 1 1
  9. 9. How to represent transactions? “Feature vectors” <ul><li>RPC counts </li></ul><ul><li>RPC occurrences </li></ul>
  10. 10. Classification scheme RPCs labeled with EUTs Learning Classifier Unlabeled RPCs EUTs Training phase Feature Extraction Classifier Training data: Operation phase “ Test” data: Feature Extraction Classification
  11. 11. Our classifier: naïve Bayes (NB) 2. Classification: given (unlabeled) instance , choose most likely class: Simplifying (“naïve”) assumption: feature independence given class <ul><li>Training: </li></ul><ul><li>estimate parameters and (e.g., ML-estimates) </li></ul>(Bayesian decision rule)
  12. 12. Classification results on Lotus CoC data <ul><li>Significant improvement over baseline classifier (75%) </li></ul><ul><li>NB is simple, efficient, and comparable to the state-of-the-art classifiers: </li></ul><ul><ul><li>SVM – 85-87%, Decision Tree – 90-92% </li></ul></ul><ul><li>Best-fit distribution (shift. geom) - not necessarily best classifier ! (?) </li></ul>Baseline classifier: Always selects most- frequent transaction Accuracy Training set size NB + Bernoulli, mult. or geom. NB + shifted geom.
  13. 13. Transaction recognition: segmentation + classification Naive Bayes classifier Dynamic programming (Viterbi search) (Recursive) DP equation:
  14. 14. Transaction recognition results Accuracy Training set size <ul><li>Good EUT recognition accuracy: 64% (harder problem than classification!) </li></ul><ul><li>Reversed order of results: best classifier - not necessarily best recognizer ! (?) </li></ul>further research! Third best best Multinomial Fourth best best Geometric best worst Shift. Geom. Second best best Bernoulli Segmentation Classification Model
  15. 15. EUT recognition: summary <ul><li>A novel approach: learning EUTs from RPCs </li></ul><ul><li>Patent, conference paper (AAAI-2000), prototype system </li></ul><ul><li>Successful results on Lotus Notes data (Lotus CoC): </li></ul><ul><ul><li>Classification – naive Bayes (up to 87% accuracy ) </li></ul></ul><ul><ul><li>EUT recognition – Viterbi+Bayes (up to 64% accuracy ) </li></ul></ul><ul><li>Work in progress: </li></ul><ul><ul><li>Better feature selection (RPC subsequences?) </li></ul></ul><ul><ul><li>Selecting “best classifier” for segmentation task </li></ul></ul><ul><ul><li>Learning more sophisticated classifiers (Bayesian networks) </li></ul></ul><ul><ul><li>Information-theoretic approach to segmentation ( MDL ) </li></ul></ul>
  16. 16. Outline <ul><li>Introduction </li></ul><ul><li>Machine-learning applications in Performance Management </li></ul><ul><ul><ul><li>Transaction Recognition </li></ul></ul></ul><ul><ul><ul><li>In progress: Event Mining; </li></ul></ul></ul><ul><ul><ul><li>Probing Strategy; etc. </li></ul></ul></ul><ul><li>Bayesian learning tools: extending ABLE </li></ul><ul><li>Advancing theory </li></ul><ul><li>Summary and future directions </li></ul>
  17. 17. Event Mining: analyzing system event sequences <ul><li>Example: USAA data </li></ul><ul><li>858 hosts, 136 event types </li></ul><ul><li>67184 data points: (13 days, by sec) </li></ul><ul><li>Event examples: </li></ul><ul><ul><li>High-severity events: </li></ul></ul><ul><ul><li>'Cisco_Link_Down‘, </li></ul></ul><ul><ul><li>'chassisMinorAlarm_On‘, etc. </li></ul></ul><ul><ul><li>Low-severity events: </li></ul></ul><ul><ul><li>'tcpConnectClose‘, 'duplicate_ip‘, etc. </li></ul></ul>Events from hosts Time (sec) What is it? Why is it important? <ul><li>learning system behavior patterns </li></ul><ul><li>for better performance management </li></ul>Why is it hard? <ul><li>large complex systems (networks) </li></ul><ul><li>with many dependencies; </li></ul><ul><li>prior models not always available </li></ul><ul><li>many events/hosts, data sets: huge </li></ul><ul><li>and constantly growing </li></ul>
  18. 18. ??? Event1 Event N 1. Learning event dependency models Event2 EventM Important issue: incremental learning from data streams <ul><li>Current approach: </li></ul><ul><ul><li>learn dynamic probabilistic graphical models </li></ul></ul><ul><ul><li>( temporal , or dynamic Bayes nets ) </li></ul></ul><ul><ul><li>Predict: </li></ul></ul><ul><ul><ul><li>time to failure </li></ul></ul></ul><ul><ul><ul><li>event co-occurrence </li></ul></ul></ul><ul><ul><ul><li>existence of hidden nodes – “root causes” </li></ul></ul></ul><ul><ul><li>Recognize sequence of high-level system states: </li></ul></ul><ul><ul><li>unsupervised version of EUT recognition problem </li></ul></ul>
  19. 19. 2. Clustering hosts by their history “ Problematic” hosts “ Silent” hosts <ul><ul><li>group hosts w/ similar event sequences : what is appropriate similarity (“distance”) metric ? One example: </li></ul></ul><ul><ul><ul><li>e.g., distance between “compressed” sequences – event distribution models: </li></ul></ul></ul>
  20. 20. Probing strategy (EPP) <ul><li>Objectives: find probe frequency F that minimizes </li></ul><ul><ul><li>E (Tprobe-Tstart) - failure detection, or </li></ul></ul><ul><ul><li>E( total “failure” time – total “estimated” failure time) - </li></ul></ul><ul><ul><li>gives accurate performance estimate </li></ul></ul><ul><li>Constraints on additional load induced by probes: L(F) < MaxLoad </li></ul>time response time Availability violations Probes
  21. 21. Outline <ul><li>Introduction </li></ul><ul><li>Machine-learning applications in Performance Management </li></ul><ul><li>Bayesian learning tools: extending ABLE </li></ul><ul><li>Advancing theory </li></ul><ul><li>Summary and future directions </li></ul>
  22. 22. ABLE: A gent B uilding and L earning E nvironment
  23. 23. What is ABLE? What is my contribution? <ul><li>A JAVA toolbox for building reasoning and learning agents </li></ul><ul><ul><li>Provides: visual environment, boolean and fuzzy rules, neural networks, genetic search </li></ul></ul><ul><li>My contributions: </li></ul><ul><ul><li>naïve Bayes classifier (batch and incremental) </li></ul></ul><ul><ul><li>Discretization </li></ul></ul><ul><ul><li>Future releases: </li></ul></ul><ul><ul><ul><li>General Bayesian learning and inference tools </li></ul></ul></ul><ul><li>Available at </li></ul><ul><ul><li>AlphaWorks: </li></ul></ul><ul><ul><li>Project page: </li></ul></ul>
  24. 24. How does it work?
  25. 25. Who is using Naïve Bayes tools? Impact on other IBM projects <ul><li>Video Character Recognition: </li></ul><ul><li>(w/ C. Dorai): </li></ul><ul><li>Naïve Bayes: 84% accuracy </li></ul><ul><li>Better than SVM on some pairs of characters (aver. SVM = 87%) </li></ul><ul><li>Current work: combining Naïve Bayes with SVMs </li></ul><ul><li>Environmental data analysis: </li></ul><ul><ul><li>(w/ Yuan-Chi Chang) </li></ul></ul><ul><li>Learning mortality rates using data on air pollutants </li></ul><ul><li>Naïve Bayes is currently being evaluated </li></ul><ul><li>Performance management: </li></ul><ul><li>Event mining – in progress </li></ul><ul><li>EUT recognition – successful results </li></ul>
  26. 26. Outline <ul><li>Introduction </li></ul><ul><li>Machine-learning in Performance Management </li></ul><ul><li>Bayesian learning tools: extending ABLE </li></ul><ul><li>Advancing theory </li></ul><ul><ul><li>analysis of naïve Bayes classifier </li></ul></ul><ul><ul><li>inference in Bayesian Networks </li></ul></ul><ul><li>Summary and future directions </li></ul>
  27. 27. Why Naïve Bayes does well? And when? When independence assumptions do not hurt classification? Class-conditional feature independence: Unrealistic assumption! But why/when it works? True NB estimate P(class|f) Class Intuition: wrong probability estimates wrong classification! Naïve Bayes: Bayes-optimal :
  28. 28. Case 1: functional dependencies <ul><ul><li>Lemma 1: Naïve Bayes is optimal when features </li></ul></ul><ul><ul><li>are functionally dependent given class </li></ul></ul><ul><ul><li>Proof : </li></ul></ul>
  29. 29. <ul><ul><li>Lemma 2: Naïve Bayes is a “good approximation” </li></ul></ul><ul><ul><li>for “almost-functional” dependencies </li></ul></ul>Case 2: “almost-functional” (low-entropy) distributions <ul><li>Related practical examples: </li></ul><ul><ul><li>RPC occurrences in EUTs : often almost-deterministic (and NB does well) </li></ul></ul><ul><ul><li>Successful “ local inference” in almost-deterministic Bayesian networks (Turbo coding, “mini-buckets” – see Dechter&Rish 2000) </li></ul></ul><ul><ul><li>Formally: </li></ul></ul>δ 1 ) a f P( or , δ 1 ) a P(f i i then If        n 1,..., i for ,
  30. 30. Experimental results support theory <ul><li>Less “noise” (smaller ) </li></ul><ul><li>=> NB closer to optimal </li></ul>Random problem generator: uniform P(class); random P(f|class): 1. A randomly selected entry in P(f|class) is assigned 2. The rest of entries – uniform random sampling + normalization 2. Feature dependence does NOT correlate with NB error
  31. 31. Outline <ul><li>Introduction </li></ul><ul><li>Machine-learning in Performance Management </li></ul><ul><ul><ul><li>Transaction Recognition </li></ul></ul></ul><ul><ul><ul><li>Event Mining </li></ul></ul></ul><ul><li>Bayesian learning tools: extending ABLE </li></ul><ul><li>Advancing theory </li></ul><ul><ul><li>analysis of naïve Bayes classifier </li></ul></ul><ul><ul><li>inference in Bayesian Networks </li></ul></ul><ul><li>Summary and future directions </li></ul>
  32. 32. From Naïve Bayes to Bayesian Networks Naïve Bayes model: independent features given class Bayesian network (BN) model: Any joint probability distributions = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) P(S, C, B, X, D)= Query: P (lung cancer =yes | smoking =no, dyspnoea =yes ) = ? lung Cancer Smoking X-ray Bronchitis Dyspnoea P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) CPD: C B D=0 D=1 0 0 0.1 0.9 0 1 0.7 0.3 1 0 0.8 0.2 1 1 0.9 0.1
  33. 33. Example: Printer Troubleshooting (Microsoft Windows 95) [Heckerman, 95] Print Output OK Correct Driver Uncorrupted Driver Correct Printer Path Net Cable Connected Net/Local Printing Printer On and Online Correct Local Port Correct Printer Selected Local Cable Connected Application Output OK Print Spooling On Correct Driver Settings Printer Memory Adequate Network Up Spooled Data OK GDI Data Input OK GDI Data Output OK Print Data OK PC to Printer Transport OK Printer Data OK Spool Process OK Net Path OK Local Path OK Paper Loaded Local Disk Space Adequate
  34. 34. How to use Bayesian networks? MEU Decision-making (given utility function) Prediction: P(symptom| cause )=? Diagnosis: P(cause| symptom )=? NP-complete inference problems Approximate algorithms Applications: <ul><li>Medicine </li></ul><ul><li>Stock market </li></ul><ul><li>Bio-informatics </li></ul><ul><li>eCommerce </li></ul><ul><li>Performance </li></ul><ul><li>management </li></ul><ul><li>etc. </li></ul>cause symptom symptom cause Classification: P(class| data )=?
  35. 35. <ul><li>Idea: </li></ul><ul><li>reduce complexity of inference by ignoring some dependencies </li></ul><ul><li>Successfully used for approximating Most Probable Explanation: </li></ul><ul><li>Very efficient on real-life (medical, decoding) and synthetic problems </li></ul>Local approximation scheme “Mini-buckets” (paper submitted to JACM) Less “noise” => higher accuracy similarly to naïve Bayes! General theory needed: Independence assumptions and “almost-deterministic” distributions noise Approximation accuracy Potential impact: efficient inference in complex performance management models (e.g., event mining, system dependence models)
  36. 36. Summary <ul><li>Theory and algorithms : </li></ul><ul><ul><li>analysis of Naïve Bayes accuracy (Research Report) </li></ul></ul><ul><ul><li>approximate Bayesian inference (submitted paper) </li></ul></ul><ul><ul><li>patent on meta-learning </li></ul></ul><ul><li>Machine-learning tools : ( alphaWorks) </li></ul><ul><ul><li>Extending ABLE w/ Bayesian classifier </li></ul></ul><ul><ul><li>Applying classifier to other IBM projects: </li></ul></ul><ul><ul><ul><li>Video character recognition </li></ul></ul></ul><ul><ul><ul><li>Environmental data analysis </li></ul></ul></ul><ul><li>Performance management: </li></ul><ul><ul><li>End-user transaction recognition: ( Lotus CoC ) </li></ul></ul><ul><ul><ul><li>novel method, patent, paper; applied to Lotus Notes </li></ul></ul></ul><ul><ul><li>In progress: event mining ( USAA ), probing strategies (EPP) </li></ul></ul>
  37. 37. Future directions Automated learning and inference Research interest Practical Problems Generic tools Theory <ul><li>Performance management: </li></ul><ul><li>Transaction recognition – better </li></ul><ul><li>feature selection, segmentation </li></ul><ul><li>Event Mining – </li></ul><ul><li>Bayes net models, clustering </li></ul><ul><li>Web log analysis – segmentation/ </li></ul><ul><li>classification/ clustering </li></ul><ul><li>Modeling system dependencies – </li></ul><ul><li>Bayes nets </li></ul><ul><li>“ Technology transfer” – generic </li></ul><ul><li>approach to “event streams” (EUTs, </li></ul><ul><li>, web page accesses) </li></ul><ul><li>ML library / ABLE: </li></ul><ul><li>Bayesian learning </li></ul><ul><ul><li>general Bayes nets </li></ul></ul><ul><ul><li>temporal BNs </li></ul></ul><ul><ul><li>incremental learning </li></ul></ul><ul><li>Bayesian inference </li></ul><ul><ul><li>Exact inference </li></ul></ul><ul><ul><li>Approximations </li></ul></ul><ul><li>Other tools: </li></ul><ul><ul><li>SVMs, decision trees </li></ul></ul><ul><ul><li>Combined tools, meta-learning tools </li></ul></ul>Analysis of algorithms: <ul><li>Naïve Bayes accuracy: other distribution types </li></ul><ul><li>Accuracy of local </li></ul><ul><li>inference approximations </li></ul><ul><li>Comparing model selection criteria (e.g., Bayes net learning) </li></ul><ul><li>Relative analysis and combination of classifiers (Bayes/max. margin/DT) </li></ul><ul><li>Incremental learning </li></ul>
  38. 38. Collaborations <ul><li>Transaction recognition </li></ul><ul><ul><li>J. Hellerstein, T. Jayram (Watson) </li></ul></ul><ul><li>Event Mining </li></ul><ul><ul><li>J. Hellerstein, R. Vilalta, S. Ma, C. Perng (Watson) </li></ul></ul><ul><li>ABLE </li></ul><ul><ul><li>J. Bigus, R. Vilalta (Watson) </li></ul></ul><ul><li>Video Character Recognition </li></ul><ul><ul><li>C. Dorai (Watson) </li></ul></ul><ul><li>MDL approach to segmentation </li></ul><ul><ul><li>B. Dom (Almaden) </li></ul></ul><ul><li>Approximate inference in Bayes nets </li></ul><ul><ul><li>R. Dechter (UCI) </li></ul></ul><ul><li>Meta-learning </li></ul><ul><ul><li>R. Vilalta (Watson) </li></ul></ul><ul><li>Environmental data analysis </li></ul><ul><ul><li>Y. Chang (Watson) </li></ul></ul>
  39. 39. Machine learning discussion group <ul><li>Weekly seminars: </li></ul><ul><ul><li>11:30-2:30 (w/ lunch) in 1S-F40 </li></ul></ul><ul><li>Active group members: </li></ul><ul><ul><li>Mark Brodie, Vittorio Castelli, Joe Hellerstein, Daniel Oblinger, </li></ul></ul><ul><ul><li>Jayram Thathachar, Irina Rish (more people joint recently) </li></ul></ul><ul><li>Agenda: </li></ul><ul><ul><li>discussions of recent ML papers, book chapters </li></ul></ul><ul><ul><li>(“Pattern Classification” by Duda, Hart, and Stork, 2000) </li></ul></ul><ul><ul><li>brain-storming sessions about particular ML topics </li></ul></ul><ul><ul><li>Recent discussions: accuracy of Bayesian classifiers (naïve Bayes) </li></ul></ul><ul><li>Web site: </li></ul><ul><li> </li></ul>