Machine Learning and Statistical Analysis


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Inductive Learning – extract rules, patterns, or information out of massive data (e.g., decision tree, clustering, …) Deductive Learning – require no additional input, but improve performance gradually (e.g., advice taker, …)
  • Machine Learning and Statistical Analysis

    1. 1. Jong Youl Choi Computer Science Department (
    2. 2. <ul><li>Social Bookmarking </li></ul>Socialized Tags Bookmarks
    3. 3.
    4. 4. <ul><li>Principles of Machine Learning </li></ul><ul><ul><li>Bayes’ theorem and maximum likelihood </li></ul></ul><ul><li>Machine Learning Algorithms </li></ul><ul><ul><li>Clustering analysis </li></ul></ul><ul><ul><li>Dimension reduction </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><li>Parallel Computing </li></ul><ul><ul><li>General parallel computing architecture </li></ul></ul><ul><ul><li>Parallel algorithms </li></ul></ul>
    5. 5. <ul><ul><li>Definition </li></ul></ul><ul><ul><li>Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. </li></ul></ul><ul><li>Algorithm Types </li></ul><ul><ul><li>Unsupervised learning </li></ul></ul><ul><ul><li>Supervised learning </li></ul></ul><ul><ul><li>Reinforcement learning </li></ul></ul><ul><li>Topics </li></ul><ul><ul><li>Models </li></ul></ul><ul><ul><ul><li>Artificial Neural Network (ANN) </li></ul></ul></ul><ul><ul><ul><li>Support Vector Machine (SVM) </li></ul></ul></ul><ul><ul><li>Optimization </li></ul></ul><ul><ul><ul><li>Expectation-Maximization (EM) </li></ul></ul></ul><ul><ul><ul><li>Deterministic Annealing (DA) </li></ul></ul></ul>
    6. 6. <ul><li>Posterior probability of  i , given X </li></ul><ul><ul><li> i 2  : Parameter </li></ul></ul><ul><ul><li>X : Observations </li></ul></ul><ul><ul><li>P (  i ) : Prior (or marginal) probability </li></ul></ul><ul><ul><li>P ( X |  i ) : likelihood </li></ul></ul><ul><li>Maximum Likelihood (ML) </li></ul><ul><ul><li>Used to find the most plausible  i 2  , given X </li></ul></ul><ul><ul><li>Computing maximum likelihood (ML) or log-likelihood </li></ul></ul><ul><ul><li> Optimization problem </li></ul></ul>
    7. 7. <ul><li>Problem </li></ul><ul><ul><li>Estimate hidden parameters (  ={  ,  }) from the given data extracted from k Gaussian distributions </li></ul></ul><ul><li>Gaussian distribution </li></ul><ul><li>Maximum Likelihood </li></ul><ul><ul><li>With Gaussian (P = N ), </li></ul></ul><ul><ul><li>Solve either brute-force or numeric method </li></ul></ul>(Mitchell , 1997)
    8. 8. <ul><li>Problems in ML estimation </li></ul><ul><ul><li>Observation X is often not complete </li></ul></ul><ul><ul><li>Latent (hidden) variable Z exists </li></ul></ul><ul><ul><li>Hard to explore whole parameter space </li></ul></ul><ul><li>Expectation-Maximization algorithm </li></ul><ul><ul><li>Object : To find ML, over latent distribution P ( Z | X ,  ) </li></ul></ul><ul><ul><li>Steps </li></ul></ul><ul><ul><li>0. Init – Choose a random  old </li></ul></ul><ul><ul><li>1. E-step – Expectation P ( Z | X ,  old ) </li></ul></ul><ul><ul><li>2. M-step – Find  new which maximize likelihood. </li></ul></ul><ul><ul><li>3. Go to step 1 after updating  old à  new </li></ul></ul>
    9. 9. <ul><li>Definition </li></ul><ul><ul><li>Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information </li></ul></ul><ul><li>Dissimilarity measurement </li></ul><ul><ul><li>Distance : Euclidean(L 2 ), Manhattan(L 1 ), … </li></ul></ul><ul><ul><li>Angle : Inner product, … </li></ul></ul><ul><ul><li>Non-metric : Rank, Intensity, … </li></ul></ul><ul><li>Types of Clustering </li></ul><ul><ul><li>Hierarchical </li></ul></ul><ul><ul><ul><li>Agglomerative or divisive </li></ul></ul></ul><ul><ul><li>Partitioning </li></ul></ul><ul><ul><ul><li>K-means, VQ, MDS, … </li></ul></ul></ul>(Matlab helppage)
    10. 10. <ul><li>Find K partitions with the total intra-cluster variance minimized </li></ul><ul><li>Iterative method </li></ul><ul><ul><li>Initialization : Randomized y i </li></ul></ul><ul><ul><li>Assignment of x ( y i fixed) </li></ul></ul><ul><ul><li>Update of y i ( x fixed) </li></ul></ul><ul><li>Problem?  Trap in local minima </li></ul>(MacKay, 2003)
    11. 11. <ul><li>Deterministically avoid local minima </li></ul><ul><ul><li>No stochastic process (random walk) </li></ul></ul><ul><ul><li>Tracing the global solution by changing level of randomness </li></ul></ul><ul><li>Statistical Mechanics </li></ul><ul><ul><li>Gibbs distribution </li></ul></ul><ul><ul><li>Helmholtz free energy F = D – TS </li></ul></ul><ul><ul><ul><li>Average Energy D = <  E x > </li></ul></ul></ul><ul><ul><ul><li>Entropy S = - P (E x ) ln P (E x ) </li></ul></ul></ul><ul><ul><ul><li>F = – T ln Z </li></ul></ul></ul><ul><li>In DA, we make F minimized </li></ul>(Maxima and Minima, Wikipedia)
    12. 12. <ul><li>Analogy to physical annealing process </li></ul><ul><ul><li>Control energy (randomness) by temperature (high  low) </li></ul></ul><ul><ul><li>Starting with high temperature (T = 1 ) </li></ul></ul><ul><ul><ul><li>Soft (or fuzzy) association probability </li></ul></ul></ul><ul><ul><ul><li>Smooth cost function with one global minimum </li></ul></ul></ul><ul><ul><li>Lowering the temperature (T ! 0) </li></ul></ul><ul><ul><ul><li>Hard association </li></ul></ul></ul><ul><ul><ul><li>Revealing full complexity, clusters are emerged </li></ul></ul></ul><ul><li>Minimization of F, using E( x , y j ) = || x - y j || 2 </li></ul><ul><ul><li>Iteratively, </li></ul></ul>
    13. 13. <ul><li>Definition </li></ul><ul><ul><li>Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. </li></ul></ul><ul><li>Curse of dimensionality </li></ul><ul><ul><li>Complexity grows exponentially in volume by adding extra dimensions </li></ul></ul><ul><li>Types </li></ul><ul><ul><li>Feature selection : Choose representatives (e.g., filter,…) </li></ul></ul><ul><ul><li>Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) </li></ul></ul>(Koppen, 2000)
    14. 14. <ul><li>Finding a map of principle components (PCs) of data into an orthogonal space, such that y = W x where W 2 R d £ h (h À d) </li></ul><ul><li>PCs – Variables with the largest variances </li></ul><ul><ul><li>Orthogonality </li></ul></ul><ul><ul><li>Linearity – Optimal least mean-square error </li></ul></ul><ul><li>Limitations? </li></ul><ul><ul><li>Strict linearity </li></ul></ul><ul><ul><li>specific distribution </li></ul></ul><ul><ul><li>Large variance assumption </li></ul></ul>x 1 x 2 PC 1 PC 2
    15. 15. <ul><li>Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 R d £ p (p À d) </li></ul><ul><li>Johnson-Lindenstrauss lemma </li></ul><ul><ul><li>When projecting to a randomly selected subspace, the distance are approximately preserved </li></ul></ul><ul><ul><li>Generating R </li></ul></ul><ul><ul><li>Hard to obtain orthogonalized R </li></ul></ul><ul><ul><li>Gaussian R </li></ul></ul><ul><ul><li>Simple approach choose r ij = {+3 1/2 ,0,-3 1/2 } with probability 1/6, 4/6, 1/6 respectively </li></ul></ul>
    16. 16. <ul><li>Dimension reduction preserving distance proximities observed in original data set </li></ul><ul><li>Loss functions </li></ul><ul><ul><li>Inner product </li></ul></ul><ul><ul><li>Distance </li></ul></ul><ul><ul><li>Squared distance </li></ul></ul><ul><li>Classical MDS: minimizing STRAIN, given  </li></ul><ul><ul><li>From  , find inner product matrix B (Double centering) </li></ul></ul><ul><ul><li>From B, recover the coordinates X’ (i.e., B=X’X’ T ) </li></ul></ul>
    17. 17. <ul><li>SMACOF : minimizing STRESS </li></ul><ul><ul><li>Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: </li></ul></ul><ul><ul><li>Majorization for STRESS </li></ul></ul><ul><ul><li>Minimize tr(X T B(Y) Y), known as Guttman transform </li></ul></ul>(Cox, 2001)
    18. 18. <ul><li>Competitive and unsupervised learning process for clustering and visualization </li></ul><ul><li>Result : similar data getting closer in the model space </li></ul>Input Model <ul><li>Learning </li></ul><ul><ul><li>Choose the best similar model vector m j with x i </li></ul></ul><ul><ul><li>Update the winner and its neighbors by </li></ul></ul><ul><ul><li>m k = m k +  (t)  (t)( x i – m k ) </li></ul></ul><ul><ul><li> (t) : learning rate </li></ul></ul><ul><ul><li> (t) : neighborhood size </li></ul></ul>
    19. 19. <ul><li>Definition </li></ul><ul><ul><li>A procedure dividing data into the given set of categories based on the training set in a supervised way </li></ul></ul><ul><li>Generalization Vs. Specification </li></ul><ul><ul><li>Hard to achieve both </li></ul></ul><ul><ul><li>Avoid overfitting(overtraining) </li></ul></ul><ul><ul><ul><li>Early stopping </li></ul></ul></ul><ul><ul><ul><li>Holdout validation </li></ul></ul></ul><ul><ul><ul><li>K-fold cross validation </li></ul></ul></ul><ul><ul><ul><li>Leave-one-out cross-validation </li></ul></ul></ul>(Overfitting, Wikipedia) Validation Error Training Error Underfitting Overfitting
    20. 20. <ul><li>Perceptron : A computational unit with binary threshold </li></ul><ul><li>Abilities </li></ul><ul><ul><li>Linear separable decision surface </li></ul></ul><ul><ul><li>Represent boolean functions (AND, OR, NO) </li></ul></ul><ul><ul><li>Network (Multilayer) of perceptrons  Various network architectures and capabilities </li></ul></ul>Weighted Sum Activation Function (Jain, 1996)
    21. 21. <ul><li>Learning weights – random initialization and updating </li></ul><ul><li>Error-correction training rules </li></ul><ul><ul><li>Difference between training data and output: E(t,o) </li></ul></ul><ul><ul><li>Gradient descent (Batch learning) </li></ul></ul><ul><ul><ul><li>With E =  E i , </li></ul></ul></ul><ul><ul><li>Stochastic approach (On-line learning) </li></ul></ul><ul><ul><ul><li>Update gradient for each result </li></ul></ul></ul><ul><li>Various error functions </li></ul><ul><ul><li>Adding weight regularization term (  w i 2 ) to avoid overfitting </li></ul></ul><ul><ul><li>Adding momentum (  w i (n-1) ) to expedite convergence </li></ul></ul>
    22. 22. <ul><li>Q: How to draw the optimal linear separating hyperplane? </li></ul><ul><ul><li> A: Maximizing margin </li></ul></ul><ul><li>Margin maximization </li></ul><ul><ul><li>The distance between H +1 and H -1 : </li></ul></ul><ul><ul><li>Thus, || w || should be minimized </li></ul></ul>Margin
    23. 23. <ul><li>Constraint optimization problem </li></ul><ul><ul><li>Given training set { x i , y i } (y i 2 {+1, -1}): </li></ul></ul><ul><ul><li>Minimize : </li></ul></ul><ul><li>Lagrangian equation with saddle points </li></ul><ul><ul><li>Minimized w.r.t the primal variable w and b: </li></ul></ul><ul><ul><li>Maximized w.r.t the dual variables  i (all  i ¸ 0) </li></ul></ul><ul><ul><li>x i with  i > 0 (not  i = 0) is called support vector (SV) </li></ul></ul>
    24. 24. <ul><li>Soft Margin (Non-separable case) </li></ul><ul><ul><li>Slack variables  i < C </li></ul></ul><ul><ul><li>Optimization with additional constraint </li></ul></ul><ul><li>Non-linear SVM </li></ul><ul><ul><li>Map non-linear input to feature space </li></ul></ul><ul><ul><li>Kernel function k( x , y ) = h  ( x ),  ( y ) i </li></ul></ul><ul><ul><li>Kernel classifier with support vectors s i </li></ul></ul>Input Space Feature Space
    25. 25. <ul><li>Memory Architecture </li></ul><ul><li>Decomposition Strategy </li></ul><ul><ul><li>Task – E.g., Word, IE, … </li></ul></ul><ul><ul><li>Data – scientific problem </li></ul></ul><ul><ul><li>Pipelining – Task + Data </li></ul></ul><ul><li>Symmetric Multiprocessor (SMP) </li></ul><ul><li>OpenMP, POSIX, pthread, MPI </li></ul><ul><li>Easy to manage but expensive </li></ul><ul><li>Commodity, off-the-shelf processors </li></ul><ul><li>MPI </li></ul><ul><li>Cost effective but hard to maintain </li></ul>(Barney, 2007) (Barney, 2007) Shared Memory Distributed Memory
    26. 26. <ul><li>Shrinking </li></ul><ul><ul><li>Recall : Only support vectors (  i >0) are used in SVM optimization </li></ul></ul><ul><ul><li>Predict if data is either SV or non-SV </li></ul></ul><ul><ul><li>Remove non-SVs from problem space </li></ul></ul><ul><li>Parallel SVM </li></ul><ul><ul><li>Partition the problem </li></ul></ul><ul><ul><li>Merge data hierarchically </li></ul></ul><ul><ul><li>Each unit finds support vectors </li></ul></ul><ul><ul><li>Loop until converge </li></ul></ul>(Graf, 2005)