Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • During the past decade, a heterogeneous spectrum of data became available describing the genome: - Seq. Data -> similarities between proteins / genes - mRNA expression levels associated with a gene: under different experimental conditions
  • Lec1-Into

    1. 1. CSE 591: Machine learning and Applications Jieping Ye Department of Computer Science & Engineering Arizona State University
    2. 2. Brief Introduction <ul><li>Dr. Jieping Ye </li></ul><ul><li>Assistant Professor at CSE Dept. </li></ul><ul><li>Affiliated with the Center for Evolutionary Functional Genomics at the Biodesign Institute </li></ul><ul><li>Research interests: machine learning, data mining and their applications to bioinformatics </li></ul><ul><ul><li>Dimensionality reduction </li></ul></ul><ul><ul><li>Semi-supervised learning </li></ul></ul><ul><ul><li>Kernel learning </li></ul></ul><ul><ul><li>Biological image analysis </li></ul></ul>
    3. 3. Outline of lecture <ul><li>Course information </li></ul><ul><li>Project </li></ul><ul><li>Introduction to ML </li></ul><ul><li>Course schedule </li></ul><ul><li>Survey </li></ul>
    4. 4. Course Information <ul><li>Instructor: Dr. Jieping Ye </li></ul><ul><li>Office: BY 568 </li></ul><ul><li>Phone: 727-7451 </li></ul><ul><li>Email: [email_address] </li></ul><ul><li>Web : http://www.public.asu.edu/~jye02/CLASSES/Spring-2007/ </li></ul><ul><li>Time: TTh 4:40am—5:55pm </li></ul><ul><li>Office hours: TTh 10:00 am -- 11:45 am </li></ul><ul><li>Location: BYAC 270 </li></ul><ul><li>TA: Jianhui Chen </li></ul><ul><ul><li>Office hours: 3:30 pm — 4:30 pm, Th </li></ul></ul>
    5. 5. Course information (Cont’d) <ul><li>Prerequisite: Basics of linear algebra, a, algorithm design and analysis. </li></ul><ul><li>Course textbook: No textbook is required. (Papers and other materials are available at the class web page) </li></ul><ul><li>Objective : An in-depth understanding of some of the important machine learning methods and their applications in bioinformatics and other domains. </li></ul><ul><li>Topics : Clustering, regression, classification, semi-supervised learning, feature reduction, manifold learning, ranking, and kernel learning. </li></ul>
    6. 6. Reference books <ul><li>Pattern Classification. Duda, et al. , 2000. </li></ul><ul><li>The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie, et al., 2001. </li></ul><ul><li>Kernel Methods in Computational Biology. Scholkopf, et al., editors. 2004. </li></ul><ul><li>Kernel Methods for Pattern Analysis. Taylor and Cristianini, 2004. </li></ul><ul><li>Introduction to Data Mining. Tan, et al., 2005. </li></ul>
    7. 7. Grading <ul><li>Homework (3): 30% </li></ul><ul><li>Project : 40%. Two to three students form a group to carry out a small research project. </li></ul><ul><ul><li>A survey of the state-of-art in an area related to this course </li></ul></ul><ul><ul><li>Machine learning techniques for specific applications </li></ul></ul><ul><ul><li>A comparative study of several well-known algorithms. </li></ul></ul><ul><ul><li>Design of a novel algorithm related to this course. </li></ul></ul><ul><li>Exam (1): 20%. There will be one open-book exam on 3/22/07 . </li></ul><ul><li>Class participation : 10%. Students are required to attend the lecture and participate in the class discussion. </li></ul><ul><li>A: 90—100, A-: 85—89, B+: 80—84, B: 70—79, C: 60—70 </li></ul>
    8. 8. Project <ul><li>Project proposal is due on 2/08/07 </li></ul><ul><ul><li>One half to one page </li></ul></ul><ul><ul><ul><li>Topics, references, and plan </li></ul></ul></ul><ul><li>The intermediate project report is due on 4/05/07 </li></ul><ul><ul><li>Five to ten pages </li></ul></ul><ul><li>The final project report is due on 4/26/07 </li></ul><ul><ul><li>Fifteen to twenty pages </li></ul></ul><ul><li>Project presentation </li></ul><ul><ul><li>About 5 minutes </li></ul></ul>
    9. 9. Programming languages <ul><li>Matlab </li></ul><ul><ul><li>Tutorials </li></ul></ul><ul><ul><ul><li>http:// www.math.ufl.edu/help/matlab -tutorial/ </li></ul></ul></ul><ul><ul><ul><li>http://www.math.mtu.edu/~msgocken/intro/node1.html </li></ul></ul></ul><ul><li>R (Statistics) </li></ul><ul><ul><li>http://www.r-project.org/ </li></ul></ul><ul><li>Or other languages </li></ul>
    10. 10. What is machine learning? <ul><li>Machine learning is the study of computer systems that improve their performance through experience. </li></ul><ul><ul><li>Learn existing and known structures and rules. </li></ul></ul><ul><ul><li>Discover new findings and structures. </li></ul></ul><ul><ul><ul><li>Face recognition </li></ul></ul></ul><ul><ul><ul><li>Bioinformatics </li></ul></ul></ul><ul><li>Supervised learning vs. unsupervised learning </li></ul><ul><li>Semi-supervised learning </li></ul>
    11. 11. Machine learning versus data mining <ul><li>A lot of common topics </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Many others </li></ul></ul><ul><li>Different focuses </li></ul><ul><ul><li>ML focuses more on theory (statistics) </li></ul></ul><ul><ul><li>DM focuses more on applications </li></ul></ul>
    12. 12. Clustering <ul><li>Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups </li></ul>Inter-cluster distances are maximized Intra-cluster distances are minimized
    13. 13. Applications of Cluster Analysis <ul><li>Understanding </li></ul><ul><ul><li>Group genes and proteins that have similar functionality, or group stocks with similar price fluctuations </li></ul></ul><ul><li>Summarization </li></ul><ul><ul><li>Reduce the size of large data sets </li></ul></ul>Clustering precipitation in Australia
    14. 14. Classification: Definition <ul><li>Given a collection of records ( training set ) </li></ul><ul><ul><li>Each record contains a set of attributes , one of the attributes is the class . </li></ul></ul><ul><li>Find a model for class attribute as a function of the values of other attributes. </li></ul><ul><li>Goal: previously unseen records should be assigned a class as accurately as possible. </li></ul><ul><ul><li>A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. </li></ul></ul>
    15. 15. Classification Example categorical categorical continuous class Training Set Learn Classifier Test Set Model
    16. 16. Classification: Application <ul><li>Fraud Detection </li></ul><ul><ul><li>Goal: Predict fraudulent cases in credit card transactions. </li></ul></ul><ul><ul><li>Approach: </li></ul></ul><ul><ul><ul><li>Use credit card transactions and the information on its account-holder as attributes. </li></ul></ul></ul><ul><ul><ul><ul><li>When does a customer buy, what does he buy, how often he pays on time, etc </li></ul></ul></ul></ul><ul><ul><ul><li>Label past transactions as fraud or fair transactions. This forms the class attribute. </li></ul></ul></ul><ul><ul><ul><li>Learn a model for the class of the transactions. </li></ul></ul></ul><ul><ul><ul><li>Use this model to detect fraud by observing credit card transactions on an account. </li></ul></ul></ul>
    17. 17. Character Recognition <ul><li>Given a digit representation. </li></ul><ul><ul><li>What is it’s class? </li></ul></ul><ul><li>AT&T have used </li></ul><ul><ul><li>Neural Networks </li></ul></ul><ul><ul><li>Support Vector Machines </li></ul></ul><ul><li>Error rates ~1.4% </li></ul><ul><li>Inputs are 28x28 greyscale images. </li></ul>
    18. 18. Other applications <ul><li>Face recognition </li></ul><ul><li>Protein function prediction </li></ul><ul><li>Cancer detection </li></ul><ul><li>Document categorization </li></ul>
    19. 19. Data representation <ul><li>Traditional algorithms work on vectors. </li></ul><ul><li>Images can be represented as matrices or vectors. </li></ul><ul><li>Abstract data </li></ul><ul><ul><li>Graphs </li></ul></ul><ul><ul><li>Sequences </li></ul></ul><ul><ul><li>3D structures </li></ul></ul>
    20. 20. Kernel Methods: Basic ideas Original Space Feature Space   
    21. 21. Applications in bioinformatics <ul><li>Protein sequence </li></ul><ul><li>Protein structure </li></ul>
    22. 22. Data integration mRNA expression data protein-protein interaction data hydrophobicity data sequence data (gene, protein) Genome-wide data
    23. 23. Curse of dimensionality <ul><li>Large sample size is required for high-dimensional data. </li></ul><ul><li>Query accuracy and efficiency degrade rapidly as the dimension increases. </li></ul><ul><li>Strategies </li></ul><ul><ul><li>Feature reduction </li></ul></ul><ul><ul><li>Feature selection </li></ul></ul><ul><ul><li>Manifold learning </li></ul></ul><ul><ul><li>Kernel learning </li></ul></ul>
    24. 24. Manifold learning <ul><li>A manifold is a topological space which is locally Euclidean. </li></ul>
    25. 25. Intuition: how does your brain store these pictures?
    26. 26. Model selection <ul><li>Choose the best model from a set of different models to fit to the data </li></ul><ul><li>Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) </li></ul><ul><ul><li>Models are specified by certain parameters. </li></ul></ul><ul><ul><ul><li>How to choose the best parameters? </li></ul></ul></ul><ul><ul><ul><li>Cross-validation (leave one out, k-fold CV) </li></ul></ul></ul>
    27. 27. Machine learning applications <ul><li>Bioinformatics : Hugh amount of biological data from the human genome project and human proteomics initiative. </li></ul><ul><ul><li>Goal: Understanding of biological systems at the molecular level from diverse sources of biological data. </li></ul></ul><ul><ul><li>Challenge: Scalability, multiple sources, abstract data. </li></ul></ul><ul><ul><li>Applications: Microarray data analysis, Protein classification, Mass spectrometry data analysis, Protein-protein interaction. </li></ul></ul><ul><li>Others : Computer vision, information retrieval, image processing, text mining, web mining, etc. </li></ul>
    28. 28. Course schedule
    29. 29. Survey <ul><li>Why are you taking this course? </li></ul><ul><li>What would you like to gain from this course? </li></ul><ul><li>What topics are you most interested in learning about from this course? </li></ul><ul><li>Any other suggestions? </li></ul>
    30. 30. Next class <ul><li>Topics </li></ul><ul><ul><li>Basics of linear algebra </li></ul></ul><ul><ul><li>Basics of probability </li></ul></ul><ul><li>Readings (available at the class webpage) </li></ul><ul><ul><li>Mini tutorial on the Singular Value Decomposition </li></ul></ul>