Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learning for_finance

1,726 views

Published on

Machine learning for finance

Published in: Data & Analytics
  • Login to see the comments

Machine learning for_finance

  1. 1. Machine Learning in Finance Stefan Duprey
  2. 2. Statistical learning scope Data Mining Exploration Univariate Pie chart, Histogram, etc… Multivariate Feature selection and transformation Modelling Clustering Partitive K-means Gaussian mixture model SOMHierarchical Classification Discriminant Decision Tree Neural Network Support Vector Machine Regression
  3. 3. Classifier for Credit Scoring
  4. 4. Decision rule for Support Vector Machines
  5. 5. A quadratic optimization problem !
  6. 6. SVM non-linear case
  7. 7. SVM summary  avoid the plague of local minima  the engineer’s expertise is in the appropriate kernel (beware of overfitting, cross-validate and experiment your own kernels)  only classify between 2 class (one vs all or one vs one methodology)  a reference in use cases in computer vision, bio informatics
  8. 8. Neural Network : what are they ?
  9. 9. Neural Network summary Gradient descent algorithm : stochastic, mini- batch, conjugate plague of local minima : difficult to calibrate  the engineer’s expertise is in the appropriate architecture (beware of overfitting, cross- validate and experiment your own architecture ‘deeper learning’)
  10. 10. >> t = classregtree(X,Y); >> Y_pred = t(X_new); Regression Trees
  11. 11. Forests of Trees predictors up down down up up up down up down up up . . . response Y >> t = TreeBagger(nb_trees,X,Y); >> [Y_pred,allpred] = predict(t,X_new);
  12. 12. Splitting criteria : information gain
  13. 13. Why a regression and what is a regression ? A regression is a model to explain and predict a process : supervised machine learning
  14. 14. Why regularizing ?• Terms are correlated • The regression matrix becomes close to singular • Badly conditioned matrix yield poor numerical results • Bayesian interpretation Likelihood Regularisation term Posterior Prior We rather minimize
  15. 15. Why Lasso and Elastic Net?• No method owns the truth • Reduce the number of predictors in a regression model • Identify important predictors • Select among redundant predictors • Produce shrinkage estimates with potentially lower predictive errors than ordinary least squares (cross validation) Lasso : Elastic Net :
  16. 16. Ensemble learning Why ensemble learning ? ‘melding results from many weak learners into one high- quality ensemble predictor’
  17. 17. Main differences between Bagging and Boosting BAGGING BOOSTING Bagging is randomness Boosting is adaptative and deterministic Bootstrapped sample Complete initial sample Each model must perform well over the whole sample Each model has to perform better than the previous one on outliers Every model have the same weight Models are weighted according to their performance Defining features Advantages and disadvantages BAGGING BOOSTING Reducing model variance Variance might rise Not a simple model anymore Not a simple model anymore Can be parallelized Can not be parallelized Less noise over fitting : better than boosting when noise Models are weighted according to their performance Bagging is usually efficienter than boosting On specific cases, boosting might achieve a far better accuracy
  18. 18. Big Data Learning over Distributed Data
  19. 19. Distributed memory : MDCS & the MAP/REDUCE paradigm
  20. 20. Big data & Machine learning “It’s not who has the best algorithm that wins . It’s who has the most data”
  21. 21. Quick overview Exploratory analysis Clustering Classification
  22. 22. Aims of this presentation  awareness of the range of methods for multivariate data  reasonable understanding of algorithms
  23. 23. Data Mining • Exploratory Data Analysis • Clustering • Classification • Regression -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Group1 Group2 Group3 Group4 Group5 Group6 Group7 Group8 • Categorical • Ordinal • Discontinuous
  24. 24. Exploratory Data Analysis  Why exploratory analysis ? Can be used to: o Graphical view o “Pre filtering”: preliminary data trends and behaviour • Means: • Multivariate Plots • Features transformation : principal component analysis, factor model • Features selection : stepwise optimization
  25. 25. Data Exploration: Getting an overview of individual variables Basic Histogram >> hist(x(:,1)) Custom Number of Bins >> hist(x(:,1),50) By Group >> hist(byGroup,20) Gaussian fit >> histfit(x(:,2)) 3D Histogram >> hist3(x(:,1:2)) Scatter Plot >>gscatter(x(:,1),x(:,2),groups) Pie Chart >> pie3(proportions,groups) >> X = [MPG,Acceleration,Displacement,Weight,Horsepower]; Box Plot >> boxplot(x(:,1),groups) 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 6 8 10 12 14 16 18 20 22 24 26 0 10 20 30 40 50 60 5 10 15 20 25 30 35 40 45 50 8 10 12 14 16 18 20 22 24 26 3 4 5 6 8 10 15 20 25 30 35 40 45 3 4 5 6 8 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 byGroup(:,1) byGroup(:,2) Group6 Group5 Group8 Group3 Group4
  26. 26. Data Exploration: Getting an overview of multiple variables Plot Matrix by Group >> gplotmatrix(x,x,groups) Parallel Coordinates Plot >> parallelcoords(x,'Group',groups) Andrews’ Plot >> andrewsplot(x,'Group',groups) Glyph Plot >> glyphplot(x) Chernoff Faces >> glyphplot(x,'Glyph','face') MPG Acceleration Displacement Weight Horsepow er MPGAccelerationDisplacementWeightHorsepower 50 1001502002000 4000200 40010 2020 40 50 100 150 200 2000 4000 200 400 10 20 20 40 MPG Acceleration Displacement Weight Horsepower -3 -2 -1 0 1 2 3 4 CoordinateValue 4 6 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -8 -6 -4 -2 0 2 4 6 8 t f(t) 4 6 8 chevrolet chevelle malibu buick skylark 320 plymouth satellite amc rebel sst ford torino ford galaxie 500 chevrolet impala plymouth fury iii pontiac catalina chevrolet chevelle malibubuick skylark 320 plymouth satellite amc rebel sst ford torino ford galaxie 500 chevrolet impala plymouth fury iii pontiac catalina
  27. 27. Principal component analysis 1 2 3 4 5 6 7 8 9 10 0 0.005 0.01 0.015 0.02 0.0249 Principal Component VarianceExplained(%) 0% 20% 40% 60% 80% 100% -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Component 1 CommerzbankDeutscheBank Infineon ThyssenKruppMANDaimlerHeidelbergerAllianzDeutscheBahnBMWSalzgitterSiemensDeutschePostLufthansa BASFAdidasMetroVWLindeEONMunichReBayerRWESAPMRKDeutscheTelekomBeiersdorf Fresenius HenkelFreseniusMedical Component 2 Component3 >>[pcs,scrs,variances]=princomp(stocks); -3 -2 -1 0 1 2 3 -2 0 2 -3 -2 -1 0 1 2 3
  28. 28. Factor model  Alternative to PCA to improve your components >>[Lambda,Psi,T,stats,F]=factoran(stocks,3,'rotate','promax); -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Component 2 DeutscheBank DaimlerAllianzMAN ThyssenKrupp BMWLufthansa Siemens DeutschePost Commerzbank BASF Adidas Linde MunichRe MetroHeidelberger SAP Bayer Salzgitter Infineon DeutscheBahn EONRWE VW DeutscheTelekom BeiersdorfMRKFresenius Henkel FreseniusMedical Component 1 Component3
  29. 29. Paring predictors : stepwise optimization Some predictors might be correlated, other irrelevant  Requires Statistics Toolbox™ >>[coeff,inOut]=stepwisefit(stocks, index); 2007 2008 2009 2010 2011 -0.1 0 0.1 0.2 0.3 Returns original data stepwise fit 2007 2008 2009 2010 2011 0.5 1 1.5 Prices
  30. 30. Cloud of randomly generated points • Each cluster center is randomly chosen inside specified bounds • Each cluster contains the specified number of points per cluster • Each cluster point is sampled from a gaussian distribution • Multidimensionnal dataset >>clusters = 8; % number of clusters. >>points = 30; % number of points in each cluster. >>std_dev = 0.05; % common cluster standard deviation >>bounds = [0 1]; % bounds for the cluster center >>[x,vcentroid,proportions,groups] =cluster_generation(bounds,clusters,points,std_dev); -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Group1 Group2 Group3 Group4 Group5 Group6 Group7 Group8
  31. 31. Clustering Why clustering ? o Segment populations into natural subgroups o Identify outliers o As a preprocessing method – build separate models on each • Means • Hierarchical clustering • Clustering with neural network (self-organizer map, competitive layer) • Clustering with K-means nearest neighbours • Clustering with K-means fuzzy logic • Clustering using Gaussian mixture models • Predictors: categorical, ordinal, discontinuous -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Input Vectors x(1) x(2)
  32. 32. Hierarchical Cluster Analysis – what is it doing? -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cutt-off = 0.1
  33. 33. Hierarchical Cluster Analysis – how do I do it ? • Calculate pairwise distances between points >> distances = pdist(x) • Carry out hierarchical cluster analysis >> tree = linkage(distances) • Visualise as a dendrogram >> dendrogram(tree) • Assign points to clusters >> assignments = cluster(tree,‘cutoff',0.1)
  34. 34. Assessing the quality of a hierarchical cluster analysis • The cophenetic correlation coefficient measures how closely the length of the tree links match the original distances between points • How ‘faithful’ the tree is to the original data • 0 is poor, 1 is good >> cophenet(tree,distances)
  35. 35. K-Means Cluster Analysis – what is it doing? Randomly pick K cluster centroids Assign points to the closest centroid Recalculate positions of cluster centroids Reassign points to the closest centroid Recalculate positions of cluster centroids Repeat until centroid positions converge ………
  36. 36. K-Means Cluster Analysis – how do I do it ? Running the K-mean algorithm for K fixed >> [memberships,centroids] = kmeans(x,K); -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  37. 37. Evaluating a K-Means analysis and choosing K • Try a range of different K’s, and compare the point-centroid distances for each >> for K=3:15 [clusters,centroids,distances] = kmeans(data,K); totaldist(K-2)=sum(distances); end plot(3:15,totaldist); • Create silhouette plots >> silhouette(data,clusters)
  38. 38. Sidebar: Distance Metrics • Measures of how similar datapoints are – different definitions make sense for different data • Many built-in distance metrics, or define your own >> doc pdist >> distances = pdist(data,metric); %pdist = pairwise distances >> squareform(distances) >> kmeans(data,k,’distance’,’cityblock’) %not all metrics supported Euclidean Distance Default Cityblock Distance Useful for discrete variables Cosine Distance Useful for clustering variables
  39. 39. Fuzzy c-means Cluster Analysis – what is it doing? • Very similar to K-means • Samples are not assigned definitively to a cluster, but have a ‘membership’ value relative to each cluster  Requires Fuzzy Logic Toolbox™  Running the fuzzy K-mean algorithm for K fixed >> [centroids, memberships]=fcm(x,K);
  40. 40. Gaussian Mixture Models • Assume that data is drawn from a fixed number K of normal distributions • Fit these parameters using the EM algorithm >> gmobj = gmdistribution.fit(x,8); >> assignments = cluster(gmobj,x);  Plot the probability density >> ezsurf(@(x,y)pdf(gmobj,[x y])); 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0 10 20
  41. 41. Evaluating a Gaussian Mixture Model clustering • Plot the probability density function of the model >> ezsurf(@(x,y)pdf(gmobj,[x y])); • Plot the posterior probabilities of observations >> p = posterior(gmobj,data); >> scatter(data(:,1),data(:,2),5,p(:,g)); % Do this for each group g • Plot the Mahalanobis distances of observations to components >> m = mahal(gmobj,data); >> scatter(data(:,1),data(:,2),5,m(:,g)); % Do this for each group g
  42. 42. Choosing the right number of components in a Gaussian Mixture Model • Evaluate for a range of K and plot AIC and/or BIC • AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are measures of the quality of the model fit, with a penalty for higher K >> for K=3:15 gmobj = gmdistribution.fit(data,K); AIC(K-2) = gmobj.AIC; end plot(3:15,AIC);
  43. 43. Neural Networks – what are they? Input variables Weights Bias Transfer function Output variable A two layer feedforward network Build your architecture
  44. 44. Self Organising Maps Neural Net – what are they? • Start with a regular grid of ‘neurons’ laid over the dataset • The size of the grid gives the number of clusters • Neurons compete to recognise datapoints (by being close to them) • Winning neurons are moved closer to the datapoints • Repeat until convergence -0.5 0 0.5 1 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 SOM Weight Positions Weight 1 Weight2 -0.2 0 0.2 0.4 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SOM Weight Positions Weight 1 Weight2
  45. 45. Summary: Cluster analysis No method owns the truth Use the diagnostic tools to assess your clusters Beware of local minima : global optimization
  46. 46. Classification  Why classification ? Can be used to: o Learning the way to classify from already classified observations oClassify new observations • Means: • Discriminant analysis classification • Bootstrapped aggregated decision tree classifier • Neural network classifier • Support vector machine classifier -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Group1 Group2 Group3 Group4 Group5 Group6 Group7 Group8
  47. 47. Discriminant Analysis – how does it work? • Fit a multivariate normal density to each class • linear — Fits a multivariate normal density to each group, with a pooled estimate of covariance. This is the default. • diaglinear — Similar to linear, but with a diagonal covariance matrix estimate (naive Bayes classifiers). • quadratic — Fits multivariate normal densities with covariance estimates stratified by group. • diagquadratic — Similar to quadratic, but with a diagonal covariance matrix estimate (naive Bayes classifiers). • Classify a new point by evaluating its probability for each density function, and classifying to the highest probability
  48. 48. Discriminant Analysis – how do I do it? • Linear Discriminant Analysis >> classes = classify(sample,training,group) • Quadratic Discriminant Analysis >> classes = classify(x,x,y,’quadratic’) • Naïve Bayes >> nbGau= NaiveBayes.fit(x, y); >> y_pred= nbGau.predict(x); -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 -0.5 0 0.5 1 1.5 x1 x2 group1 group2 group3 group4 group5 group6 group7 group8 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 -0.5 0 0.5 1 1.5 x1 x2 group1 group2 group3 group4 group5 group6 group7 group8
  49. 49. Interpreting Discriminant Analyses • Visualise the posterior probability surfaces >> [XI,YI] = meshgrid(linspace(4,8), linspace(2,4.5)); >> X = XI(:); Y = YI(:); >> [class,err,P] = classify([X Y], meas(:,1:2), species,'quadratic'); >> for i=1:3 ZI = reshape(P(:,i),100,100); surf(XI,YI,ZI,'EdgeColor','none'); hold on; end
  50. 50. Interpreting Discriminant Analyses • Visualise the probability density of sample observations • An indicator of the region in which the model has support from training data >> [XI,YI] = meshgrid(linspace(4,8), linspace(2,4.5)); >> X = XI(:); Y = YI(:); >> [class,err,P,logp] = classify([X Y], meas(:,1:2), species, 'quadratic'); >> ZI = reshape(logp,100,100); >> surf(XI,YI,ZI,'EdgeColor','none');
  51. 51. Classifying K-Nearest Neigbours – what does it do? • One of the simplest classifiers – a sample is classified by taking the K nearest points from the training set, and choosing the majority class of those K points • There is no real training phase – all the work is done during the application of the model >> classes = knnclassify(sample,training,group,K) -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 -0.5 0 0.5 1 1.5 x1 x2 group1 group2 group3 group4 group5 group6 group7 group8
  52. 52. Decision Trees – how do they work? • Threshold value for a variable that partitions the dataset • Threshold for all predictors • Resulting model is a tree where each node is a logical test on a predictor (var1<thresh1, var2>thresh2)
  53. 53. Decision Trees – how do I build them ? • Build tree model >> tree = classregtree(x,y); >> view(tree) • Evaluate the model on new data >> tree(x_new) -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 -0.5 0 0.5 1 1.5 x1 x2 group1 group2 group3 group4 group5 group6 group7 group8
  54. 54. Enhancing the model : bagged trees • Prune the decision tree >> [cost,secost,ntnodes,bestlevel] =test(t, 'test', x, y); >> topt = prune(t, 'level', bestlevel); • Bootstrapped aggregated trees forest >> [cost,secost,ntnodes,bestlevel] =test(t, 'test', x, y); >> forest = TreeBagger(100, x, y); >> y_pred = predict(forest,x); • Visualise class boundaries as before -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 -0.5 0 0.5 1 1.5 x1 x2 group1 group2 group3 group4 group5 group6 group7 group8
  55. 55. Pattern Recognition Neural Network– what are they? • Two-layer (i.e. one-hidden-layer) feed forward neural networks can learn any input-output relationship given enough neurons in the hidden layer. • No restrictions on the predictors
  56. 56. Pattern Recognition Neural Network– how do I build them ? • Build a neural network model >> net = patternnet(10); • Train the net to classify observations >> [net,tr] = train(net,x,y); • Apply the model to new data >> y_pred = net(x); 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x1 x2 1 2 3 4 5 6 7 8
  57. 57. Support Vector Machines – what are they? • The SVM algorithm finds a boundary between the classes that maximises the minimum distance of the boundary to any of the points • No restrictions on the predictors • 1 vs all to classify multiple classes
  58. 58. Support Vector Machines – how do I build them ? • Build an SVM model >> svmmodel = svmtrain(x,y) • Try different kernel functions >> svmmodel = svmtrain(x,y,’kernel_function’,’rbf’) • Apply the model to new data >> classes = svmclassify(svmmodel,x_new); -1 0 1 2 3 4 1 2 Support Vectors
  59. 59. Evaluating a Classifying Model • Three main strategies • Resubstitution – test the model on the same data that you trained it with • Cross-Validation • Holdout Test on a completely new dataset • Use cross-validation to evaluate model parameters such as the number of leaf for a tree or the number of hidden neurons.  Apply cross validation to your classifying model >> cp = cvpartition(y,'k',10); >> ldaFun= @(xtrain,ytrain,xtest)(classify(xtest,xtrain,ytrain)); >> ldaCVErr = crossval('mcr',x,y,'predfun',ldaFun,'partition',cp)
  60. 60. Summary: Classification algorithms No absolute best methods Simple does not mean inefficient Decision trees produce models and neural network overfit the noise : use bootstrapping and cross-validation Parallelize
  61. 61. Regression Why Regression ? Can be used to: oLearn to model a continuous response from observations oPredict the response for new observations • Means: • Linear regressions • Non-linear regressions • Bootstrapped regression tree • Neural network as a fitting tool
  62. 62. New data set with a continuous response from one predictor • Non-linear function to fit • A continuous response to fit from one continuous predictor >>[x,t] = simplefit_dataset; 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  63. 63. Linear Regression – what is it? • A collection of methods that find the best coefficients b such that y ≈ X*b • Best b means minimising the least squares difference between the predicted and actual values of y • “Linear” means linear in b – you can include extra variables to give a nonlinear relationship in X
  64. 64. Linear Regression – how do I do it ? >> b = xy • Linear Regression >> b = regress(y, [ones(size(X,1),1) x]) >> stats = regstats(y, [ones(size(x,1),1) x]) • Robust Regression – better in the presence of outliers >> robust_b = robustfit(X,y) %NB (X,y) not (y,X) • Ridge Regression – better if data is close to collinear >> ridge_b = ridge(y,X,k) %k is the ridge parameter • Apply the model to new data >> y = newdata*b;
  65. 65. Interpreting a linear regression model • Examine coefficients to see which predictors have a large effect on the response >> [b,bint,r,rint,stats]=regress(y,X) >> errorbar(1:size(b,1),b, b- bint(:,1),bint(:,2)-b) • Examine residuals to check for possible outliers >> rcoplot(r,rint) • Examine R2 statistic and p- value to check overall model significance >> stats(1)*100 %R2 as a percentage >> stats(3) %p-value • Additional diagnostics with regstats
  66. 66. Non linear curve fitting Least square algorithm >> model = @(b,x)(b(1)+b(2).*cos(b(3)*x+b(4))+b(5).*cos(b(6)*x+b(7))+b(8).*cos(b(9)*x+b(10))); >> [ahat,r,J,cov,mse] = nlinfit(x,t,model,a0); 0 1 2 3 4 5 6 7 8 9 10 -5 0 5 10 15 0 10 20 30 40 50 60 70 80 90 100 0 0.05 0.1 0.15 0.2
  67. 67. Fit Neural Network– what are they? • Fitting networks are feedforward neural networks used to fit an input-output relationship. • This architecture can learn any input-output relationship given enough neurons. • No restrictions on the predictors (categorical,ordinal,discontinuous)
  68. 68. Fit Neural Network– how do I build them ? • Build a fit neural net model >> net = fitnet(10); • Train the net to fit the target >> [net,tr] = train(net,x,t); • Apply the model to new data >> y_pred = net(x); 0 1 2 3 4 5 6 7 8 9 -2 0 2 4 6 8 10 12 Function Fit for Output Element 1 OutputandTarget -0.02 0 0.02 0.04 Error Input Targets Outputs Errors Fit Targets - Outputs
  69. 69. Regression trees– what are they? • A decision tree with binary splits for regression. An object of class RegressionTree can predict responses for new data with the predict method. • No restrictions on the predictors (categorical,ordinal,discontinuous)
  70. 70. Regression trees – how do I use them? • Build a fit neural net model >> rtree = RegressionTree.fit(x,t); • Train the net to fit the target >> y_tree = predict(rtree,x); • Apply the model to new data >> y_pred = net(x); 0 1 2 3 4 5 6 7 8 9 10 0 5 10 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 x 10 -15
  71. 71. Summary Data Mining Exploration Univariate Pie chart, Histogram, etc… Multivariate Feature selection and transformation Modelling Clustering Partitive K-means Gaussian mixture model SOMHierarchical Classification Discriminant Decision Tree Neural Network Support Vector Machine Regression

×