Upcoming SlideShare
×

# Jörg Stelzer

429 views
388 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
429
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Jörg Stelzer

1. 1. Machine Learning with T MVA A ROOT based Tool for Multivariate Data Analysis DESY Computing Seminar Hamburg 14.1.2008 The TMVA developer team: Andreas H ö cker, Peter Speckmeyer, J ö rg Stelzer , Helge Voss
2. 2. General Event Classification Problem <ul><li>Event described by k variables (that are found to be discriminating)  (x i )   k </li></ul><ul><li>Events can be classified into n categories: H 1 … H n </li></ul><ul><li>General classifier: f:  k   , (x i )  {1,…,n} </li></ul><ul><li>T MVA: only n=2 </li></ul><ul><ul><li>Commonly the case in HEP (signal/background) </li></ul></ul><ul><li>Most classification methods f:  k   d , (x i )  (y i ) </li></ul><ul><ul><li>Further:  d   , (y i )  {1,…,n} </li></ul></ul><ul><ul><li>TMVA: d=1  y ≥ y sep : signal, y<y sep : background </li></ul></ul>Example: k=2, n=3 H 2 H 1 x 1 x 2 H 3
3. 3. General Event Classification Problem <ul><li>Example: k=2 variables x 1,2 , n=3 categories H 1 , H 2 , H 3 </li></ul><ul><li>The problem: How to draw the boundaries between H 1 , H 2 , and H 3 such that f( x ) returns the true nature of x with maximum correctness </li></ul>Simple example  I can do it by hand. H 2 H 1 x 1 x 2 H 3 Non-linear Boundaries ? H 2 H 1 x 1 x 2 H 3 Linear Boundaries ? H 2 H 1 x 1 x 2 H 3 Rectangular Cuts ?
4. 4. <ul><li>Large input variable space, complex correlations: manual optimization very difficult </li></ul><ul><li>2 general ways to build f(x): </li></ul><ul><li>Supervised learning: in an event sample the category of each event is known. Machine adapts to give the smallest misclassification error on training sample. </li></ul><ul><li>Unsupervised learning: the correct category of each event is unknown. Machinery tries to discover structures in the dataset </li></ul><ul><li>All classifiers in TMVA are supervised learning methods </li></ul>General Event Classification Problem <ul><ul><li>What is the optimal boundary f(x) to separate the categories </li></ul></ul><ul><ul><li>More pragmatic: Which classifier is best to find this optimal boundary (or estimates it closest) </li></ul></ul>Machine Learning
5. 5. Classification Problems in HEP <ul><li>In HEP mostly two class problems – signal (S) and background (B) </li></ul><ul><ul><li>Event level (Higgs searches, …) </li></ul></ul><ul><ul><li>Cone level (Tau-vs-jet reconstruction, …) </li></ul></ul><ul><ul><li>Track level (particle identification, …) </li></ul></ul><ul><ul><li>Lifetime and flavour tagging ( b -tagging, …) </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><li>Input information </li></ul><ul><ul><li>Kinematic variables (masses, momenta, decay angles, …) </li></ul></ul><ul><ul><li>Event properties (jet/lepton multiplicity, sum of charges, …) </li></ul></ul><ul><ul><li>Event shape (sphericity, Fox-Wolfram moments, …) </li></ul></ul><ul><ul><li>Detector response (silicon hits, dE / dx , Cherenkov angle, shower profiles, muon hits, …) </li></ul></ul><ul><ul><li>… </li></ul></ul>
6. 6. Classifiers in T MVA
7. 7. Rectangular Cut Optimization <ul><li>Intuitive and simple: rectangular volumes in variable space </li></ul><ul><li>Technical challenge: cut optimization: </li></ul><ul><ul><li>MINUIT fit : (simplex) was found not to be reliable </li></ul></ul><ul><ul><li>Monte Carlo sampling : </li></ul></ul><ul><ul><ul><li>random scanning of parameter space </li></ul></ul></ul><ul><ul><ul><li>inefficient for large number of input variables </li></ul></ul></ul><ul><ul><li>Genetic algorithm : preferred method </li></ul></ul><ul><ul><ul><li>Samples of cut-sets (a population ) are evaluated, the fittest individuals are cross-bred (including mutation) to create a new generation </li></ul></ul></ul><ul><ul><li>The Genetic Algorithm can also be used as standalone optimizer, outside the TMVA framework </li></ul></ul><ul><ul><li>Simulated annealing : still need to optimize its performance </li></ul></ul><ul><ul><ul><li>Simulated slow cooling of metal, introduce temperature dependent perturbation probability to recover from local minima </li></ul></ul></ul><ul><li>Cuts usually benefit from prior decorrelation of cut variables </li></ul>
8. 8. Projective Likelihood Estimator (PDE) <ul><li>Probability density estimators for each input variable combined in likelihood estimator </li></ul><ul><li>Optimal MVA approach, if variables are uncorrelated </li></ul><ul><ul><li>In practice rarely the case, solution: de-correlate input or use different method </li></ul></ul><ul><li>Reference PDFs are automatically generated from training data: </li></ul><ul><ul><li>Histograms (counting), splines (order 2,3,5), or unbinned kernel estimator </li></ul></ul><ul><li>Output of likelihood estimator often strongly peaked at 0 and 1. To ease output parameterization TMVA applies inverse Fermi transformation. </li></ul>Reference PDF’s
9. 9. Estimating PDF Kernels <ul><li>Technical challenge: how to estimate the PDF shapes </li></ul><ul><ul><li>3 ways: </li></ul></ul><ul><li>We have chosen to implement nonparametric fitting in T MVA </li></ul><ul><ul><li>Binned shape interpolation using spline functions (orders: 1, 2, 3, 5) </li></ul></ul><ul><ul><li>Unbinned kernel density estimation (KDE) with Gaussian smearing </li></ul></ul><ul><ul><li>T MVA performs automatic validation of goodness-of-fit </li></ul></ul>original distribution is Gaussian Easy to automate, can create artefacts/suppress information Difficult to automate for arbitrary PDFs <ul><ul><li>parametric fitting (function) nonparametric fitting event counting </li></ul></ul>Automatic, unbiased, but suboptimal
10. 10. Multidimensional PDE <ul><li>Extension of the one-dimensional PDE approach to n dimensions </li></ul><ul><ul><li>Counts signal and background reference events (training sample) in the vicinity V of the test event </li></ul></ul><ul><li>Volume V definition: </li></ul><ul><ul><li>Size : fixed (defined by the data: % of Max-Min or RMS) or adaptive (define by number of events in search volume) </li></ul></ul><ul><ul><li>Shape : box or ellipsoid </li></ul></ul><ul><li>Improve y PDERS estimate within V by using various n -D kernel estimators (function of the (normalized) distance between test- and reference events) </li></ul><ul><li>Practical challenges: </li></ul><ul><ul><li>Need very large training sample ( curse of dimensionality of kernel based methods) </li></ul></ul><ul><ul><li>No training, slow evaluation. </li></ul></ul><ul><ul><ul><li>Search speed improvement with kd-tree event sorting </li></ul></ul></ul>test event Carli-Koblitz, NIM A501, 576 (2003) H 1 H 0 x 1 x 2
11. 11. Fisher’s Linear Discriminant Analysis <ul><li>Well-known, simple and elegant MVA method </li></ul><ul><ul><li>Fisher analysis determines an axis in the input variable hyperspace (F 1 ,…,F n , such that a projection of events onto this axis separates signal and background as much as possible </li></ul></ul><ul><li>Optimal for linearly correlated Gaussian variables with different S and B means </li></ul><ul><ul><li>Variable v with the same S and B sample mean  F v =0 </li></ul></ul>Projection: Fisher Coefficients: W: sum of S and B covariance matrices <ul><li>classifier: Function discriminant analysis (FDA) </li></ul><ul><li>Fit any user-defined function of input variables requiring that signal events return  1 and background  0 </li></ul><ul><ul><li>Parameter fitting: Genetics Alg., MINUIT, MC and combinations </li></ul></ul><ul><ul><li>Easy reproduction of Fisher result, but can add nonlinearities </li></ul></ul><ul><ul><li>Very transparent discriminator </li></ul></ul>New
12. 12. Artificial Neural Network (ANN) <ul><li>Multilayer perceptron: fully connected, feed forward, k hidden layers </li></ul><ul><li>ANNs are non-linear discriminants </li></ul><ul><ul><li>Non linearity from activation function. (Fisher is an ANN with linear activation function) </li></ul></ul><ul><li>Training: back-propagation method </li></ul><ul><ul><li>Randomly feed signal and background events to MLP and compare the desired output {0,1} with the received output (0,1): ε = d - r </li></ul></ul><ul><ul><li>Correct weights, depending on ε and learning rate η </li></ul></ul>1 i . . . N 1 input layer k hidden layers 1 ouput layer 1 j M 1 . . . . . . 1 . . . M k N var discriminating input variables . . . . . . 1 output variable Weierstrass theorem: MLP can approximate every continuous function to arbitrary precision with just one layer and infinite number of nodes y’ j Typical activation function A v’ j
13. 13. Boosted Decision Trees (BDT) <ul><li>A DT is a series of cuts that split sample set into ever smaller sets, leafs are assigned either S or B status </li></ul><ul><ul><li>Classifies events by following a sequence of cuts depending on the events variable content until a S or B leaf </li></ul></ul><ul><li>Growing </li></ul><ul><li>Each split try to maximizing gain in separation (Gini-index) </li></ul><ul><li>DT dimensionally robust and easy to understand but not powerful </li></ul><ul><li>1. Pruning </li></ul><ul><li>Bottom-up pruning of a decision tree </li></ul><ul><li>Protect from overtraining by removing statistically insignificant nodes </li></ul><ul><li>2. Boosting (Adaboost) </li></ul><ul><li>Increase the weight of incorrectly identified events  build new DT </li></ul><ul><li>Final classifier: ‘forest’ of DT’s linearly combined </li></ul><ul><li>Large coefficient for DT with small misclassification </li></ul><ul><li>Improved performance and stability </li></ul>BDT requires only little tuning to achieve good performance S,B S 1 ,B 1 S 2 ,B 2
14. 14. Predictive Learning via Rule Ensembles (RuleFit) <ul><li>Following RuleFit approach by Friedman-Popescu </li></ul><ul><li>Model is linear combination of rules, where a rule is a sequence of cuts defining a region in the input parameter space </li></ul><ul><li>The problem to solve is </li></ul><ul><ul><li>Create rule ensemble: use forest of decision trees either from a BDT, or from a random forest generator (TMVA) </li></ul></ul><ul><ul><li>Fit coefficients a m , b k , minimizing risk of misclassification (Friedman et al.) </li></ul></ul><ul><li>Pruning removes topologically equal rules” (same variables in cut sequence) </li></ul>Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003 rules (cut sequence  r m =1 if all cuts satisfied, =0 otherwise) normalised discriminating event variables RuleFit classifier Linear Fisher term Sum of rules
15. 15. Support Vector Machine <ul><li>Find hyperplane that between linearly separable signal and background (1962) </li></ul><ul><ul><li>Best separation: maximum distance (margin) between closest events ( support ) to hyperplane </li></ul></ul><ul><ul><li>Wrongly classified events add extra term to cost-function which is minimized </li></ul></ul><ul><li>Non-linear cases: </li></ul><ul><ul><li>Transform variables into higher dimensional space where again a linear boundary (hyperplane) can separate the data (only mid-’90) </li></ul></ul><ul><ul><li>Explicit transformation form not required, cost function depends on scalar product between events: use Kernel Functions to approximate scalar products between transformed vectors in the higher dimensional space </li></ul></ul><ul><ul><li>Choose Kernel and fit the hyperplane using the linear techniques developed above </li></ul></ul><ul><ul><li>Available Kernels: Gaussian, Polynomial, Sigmoid </li></ul></ul>x 1 x 2 margin support vectors Separable data optimal hyperplane Non-separable data  (x 1 ,x 2 ) x 1 x 2 x 1 x 3 x 1 x 2
16. 16. Data Preprocessing: Decorrelation <ul><li>Various classifiers perform sub-optimal in the presence of correlations between input variables (Cuts, Projective LH), others are slower (BDT, RuleFit) </li></ul><ul><li>Removal of linear correlations by rotating input variables </li></ul><ul><ul><li>Determine square-root C  of covariance matrix C , i . e ., C = C  C  </li></ul></ul><ul><ul><li>Transform original ( x i ) into decorrelated variable space ( x i  ) by: x  = C   1 x </li></ul></ul><ul><li>Also implemented Principal Component Analysis (PCA) </li></ul><ul><li>Note that decorrelation is only complete, if </li></ul><ul><ul><li>Correlations are linear </li></ul></ul><ul><ul><li>Input variables are Gaussian distributed </li></ul></ul><ul><ul><li>Not very accurate conjecture in general </li></ul></ul>original SQRT derorr. PCA derorr.
17. 17. Is there a best Classifier <ul><li>Performance </li></ul><ul><ul><li>In the presence/absence of linear/nonlinear correlations </li></ul></ul><ul><li>Speed </li></ul><ul><ul><li>Training / evaluation time </li></ul></ul><ul><li>Robustness, stability </li></ul><ul><ul><li>Sensitivity to overtraining, weak input variables </li></ul></ul><ul><ul><li>Size of training sample </li></ul></ul><ul><li>Dimensional scalability </li></ul><ul><ul><li>Do performance, speed, and robustness deteriorate with large dimensions </li></ul></ul><ul><li>Clarity </li></ul><ul><ul><li>Can the learning procedure/result be easily understood/visualized </li></ul></ul>
18. 18. No Single Best         SVM         Curse of dimensionality Classifiers Clarity Robust-ness Speed Perfor-mance Criteria        MLP        BDT        RuleFit        H-Matrix        Fisher    Weak input variables      /     PDERS/ k-NN Overtraining Response Training nonlinear correlations no / linear correlations      Cuts      Likeli-hood
19. 19. What is T MVA <ul><li>Motivation: </li></ul><ul><li>Classifiers perform very different depending on the data, all should be tested on a given problem </li></ul><ul><ul><li>Situation for many year: usually only a small number of classifiers were investigated by analysts </li></ul></ul><ul><ul><li>Needed a Tool that enables the analyst to simultaneously evaluate the performance of a large number of classifiers on his/her dataset </li></ul></ul><ul><li>Design Criteria: </li></ul><ul><li>Performance and Convenience (A good tool does not have to be difficult to use) </li></ul><ul><ul><li>Training , testing , and evaluation of many classifiers in parallel </li></ul></ul><ul><ul><li>Preprocessing of input data: decorrelation (PCA, Gaussianization) </li></ul></ul><ul><ul><li>Illustrative tools to compare performance of all classifiers (ranking of classifiers, ranking of input variable, choice of working point) </li></ul></ul><ul><ul><li>Actively protect against overtraining </li></ul></ul><ul><ul><li>Straight forward application to test data </li></ul></ul><ul><li>Special needs of high energy physics addressed </li></ul><ul><ul><li>Two classes, events weights, familiar terminology </li></ul></ul>
20. 20. <ul><li>A typical T MVA analysis consists of two main steps: </li></ul><ul><li>Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition </li></ul><ul><li>Application phase: using selected trained classifiers to classify unknown data samples </li></ul>Using T MVA
21. 21. Technical Aspects <ul><li>TMVA is open source, written in C++, and based on and part of ROOT </li></ul><ul><ul><li>Development on SourceForge, there is all the information http:// sf.tmva.net </li></ul></ul><ul><ul><li>Bundled with ROOT since 5.11-03 </li></ul></ul><ul><li>Training requires ROOT-environment, resulting classifiers also available as standalone C++ code </li></ul><ul><li>Six core developers, many contributors </li></ul><ul><ul><li>> 1400 downloads since Mar 2006 (not counting ROOT users) </li></ul></ul><ul><ul><li>Mailing list for reporting problems </li></ul></ul>Users Guide at http:// sf.tmva.net : 97p., classifier descriptions, code examples arXiv physics/0703039
22. 22. Training with T MVA <ul><li>User usually starts with template TMVAnalysis.C </li></ul><ul><ul><li>Choose training variables </li></ul></ul><ul><ul><li>Choose input data </li></ul></ul><ul><ul><li>Select classifiers (adjust training options – described in the manual by specifying option ‘H’) </li></ul></ul>Template TMVAnalysis.C (also .py) available at \$TMVA/macros/ and \$ROOTSYS/tmva/test/ T MVA GUI
23. 23. Evaluation Output <ul><li>Remark on overtraining </li></ul><ul><li>Occurs when classifier training becomes sensitive to the events of the particular training sample, rather then just to the generic features </li></ul><ul><li>Sensitivity to overtraining depends on classifier: e.g., Fisher insensitive, BDT very sensitive </li></ul><ul><li>Detect overtraining: compare performance between training and test sample </li></ul><ul><li>Counteract overtraining: e.g., smooth likelihood PDFs, prune decision trees, … </li></ul>Evaluation results ranked by best signal efficiency and purity (area) ------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi- Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance: ------------------------------------------------------------------------------ Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189 MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260 LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251 PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192 RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092 HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093 BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099 CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000 Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682 ------------------------------------------------------------------------------ Testing efficiency compared to training efficiency (overtraining check) ------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from training sample) Methods: @B=0.01 @B=0.10 @B=0.30 ------------------------------------------------------------------------------ Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873) MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873) LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872) PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881) RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848) HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868) BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911) CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715) Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677) ----------------------------------------------------------------------------- Better classifier
24. 24. More Evaluation Output --- Fisher : Ranking result (top variable is best ranked) --- Fisher : ---------------------------------------------------------------- --- Fisher : Rank : Variable : Discr. power --- Fisher : ---------------------------------------------------------------- --- Fisher : 1 : var4 : 2.175e-01 --- Fisher : 2 : var3 : 1.718e-01 --- Fisher : 3 : var1 : 9.549e-02 --- Fisher : 4 : var2 : 2.841e-02 --- Fisher : ---------------------------------------------------------------- Better variable --- Factory : Inter-MVA overlap matrix (signal): --- Factory : ------------------------------ --- Factory : Likelihood Fisher --- Factory : Likelihood: +1.000 +0.667 --- Factory : Fisher: +0.667 +1.000 --- Factory : ------------------------------ Input Variable Ranking Classifier correlation and overlap <ul><li>how useful is a variable? </li></ul><ul><li>do classifiers perform the same separation into signal and background? If two classifiers have similar performance, but significant non-overlapping classifications  check if you can combine them! </li></ul>
25. 25. Graphical Evaluation <ul><li>Classifier output distributions for independent test sample: </li></ul>
26. 26. Graphical Evaluation <ul><li>There is no unique way to express the performance of a classifier  several benchmark quantities computed by T MVA </li></ul><ul><li>Signal eff. at various background effs. (= 1 – rejection) when cutting on classifier output </li></ul><ul><li>The Separation: </li></ul><ul><li>“ Rarity” implemented (background flat): </li></ul><ul><ul><li>Comparison of signal shapes between different classifiers </li></ul></ul><ul><ul><li>Quick check: background on data should be flat </li></ul></ul>
27. 27. Visualization Using the GUI <ul><li>Projective likelihood PDFs, MLP training, BDTs, … </li></ul>average no. of nodes before/after pruning: 4193 / 968
28. 28. Choosing a Working Point <ul><li>Depending on the problem the user might want to </li></ul><ul><ul><li>Achieve a certain signal purity, signal efficiency, or background reduction, or </li></ul></ul><ul><ul><li>Find the selection that results in the highest signal significance (depending on the expected signal and background statistics) </li></ul></ul><ul><li>Using the TMVA graphical output one can determine at which classifier output value he needs to cuts to separate signal from background </li></ul>
29. 29. Applying the Trained Classifier <ul><li>Use the TMVA::Reader class, example in TMVApplication.C: </li></ul><ul><ul><li>Set input variables </li></ul></ul><ul><ul><li>Book classifier with the weight file (contains all information) </li></ul></ul><ul><ul><li>Compute classifier response inside event loop  use it </li></ul></ul><ul><li>Also standalone C++ class without ROOT dependence </li></ul>Templates TMVApplication.C ClassAplication.C available at \$TMVA/macros/ and \$ROOTSYS/tmva/test/ std::vector<std::string> inputVars; … classifier = new ReadMLP ( inputVars ); for (int i=0; i<nEv; i++) { std::vector<double> inputVec = …; double retval = classifier->GetMvaValue( *inputVec ); } from ClassApplication.C
30. 30. Extending T MVA <ul><li>A user might have an own implementation of a multivariate classifier, or wants to use an external one </li></ul><ul><li>With ROOT 5.18.00 (16.Jan.08) user can seamlessly evaluate and compare his own classifier within TMVA: </li></ul><ul><li>Requirement: An own class must be derived from TMVA::MethodBase and must implement the TMVA::IMethod interface </li></ul><ul><li>The class must be added to the factory via ROOT’s plugin mechanism </li></ul><ul><li>Training, testing, evaluation, and comparison can then be done as usual, </li></ul><ul><li>Example in TMVAnalysis.C </li></ul>
31. 31. A Word on Treatment of Systematics? <ul><li>Some things could be done: </li></ul><ul><li>Example: var4 may in reality have a shifted central value and hence a worse discrimination power </li></ul><ul><li>One can: ignore the systematic in the training </li></ul><ul><ul><li>var4 appears stronger in training than it might be </li></ul></ul><ul><ul><li>suboptimal performance (bad training, not wrong) </li></ul></ul><ul><ul><li>Classifier response will strongly depend on “var4”, and hence will have a larger systematic uncertainty </li></ul></ul><ul><li>Better: Train with shifted (weakened) var4 </li></ul><ul><li>Then evaluate systematic error on classifier output </li></ul><ul><li>There is no principle difference in systematics evaluation between single discriminating variables and MV classifiers </li></ul><ul><li>Control sample to estimate uncertainty on classifier output (not necessarily for each input variable) </li></ul><ul><ul><li>Advantage: correlations automatically taken into account </li></ul></ul>
32. 32. Conclusion Remarks <ul><li>Multivariate classifiers are no black boxes, we just need to understand them </li></ul><ul><ul><li>Cuts and Likelihood are transparent  if they perform use them </li></ul></ul><ul><ul><li>In presence of correlations other classifiers are better </li></ul></ul><ul><ul><ul><li>Difficult to understand at any rate </li></ul></ul></ul><ul><li>Enormous acceptance growth in recent decade in HEP </li></ul><ul><li>TMVA provides means to train, evaluate, compare, and apply different classifiers </li></ul><ul><li>TMVA also tries – through visualization – improve the understanding of the internals of each classifier </li></ul>Acknowledgments : The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. We thank in particular the CERN Summer students Matt Jachowski (Stanford) for the implementation of TMVA's new MLP neural network, Yair Mahalalel (Tel Aviv) for a significant improvement of PDERS, and Or Cohen for the development of the general classifier boosting, the Krakow student Andrzej Zemla and his supervisor Marcin Wolter for programming a powerful Support Vector Machine, as well as Rustem Ospanov for the development of a fast k-NN algorithm. We are grateful to Doug Applegate, Kregg Arms, Ren é Brun and the ROOT team, Tancredi Carli, Zhiyi Liu, Elzbieta Richter-Was, Vincent Tisserand and Alexei Volk for helpful conversations.
33. 33. Outlook <ul><li>Primary development from this Summer: Generalized classifiers </li></ul><ul><li>Be able to boost or bag any classifier </li></ul><ul><li>Combine any classifier with any other classifier using any combination of input variables in any phase space region </li></ul><ul><li>1. is ready – now in testing mode. To be deployed after upcoming ROOT release. </li></ul>
34. 34. A Few Toy Examples
35. 35. Checker Board Example <ul><li>Performance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiers </li></ul><ul><li>After specific tuning, also SVM und MLP perform well </li></ul>Theoretical maximum
36. 36. Linear-, Cross-, Circular Correlations <ul><li>Illustrate the behavior of linear and nonlinear classifiers </li></ul>Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background)
37. 37. Linear-, Cross-, Circular Correlations <ul><li>Plot test-events weighted by classifier output (red: signal-like, blue: background-like) </li></ul>Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Likelihood Likelihood - D PDERS Fisher MLP BDT
38. 38. Final Performance <ul><li>Background rejection versus signal efficiency curve: </li></ul>Linear Example Cross Example Circular Example