Data and Text Mining for Computational Biology


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data and Text Mining for Computational Biology

  1. 1. Data and Text Mining for Computational Biology Introduction
  2. 2. Course information <ul><li>CS 6365 </li></ul><ul><li>Data and Text Mining for Computational Biology </li></ul><ul><li>Meets Monday and Wednesday 4:00-5:15 pm at ECSS 2.203 </li></ul>
  3. 3. Instructor <ul><li>Vasileios Hatzivassiloglou </li></ul><ul><li>Associate Professor, Computer Science </li></ul><ul><li>Founding Professor, Bioengineering </li></ul><ul><li>Research focus: Discover knowledge from massive amounts of raw data </li></ul><ul><ul><li>data not the same as information </li></ul></ul><ul><ul><li>information overload </li></ul></ul>
  4. 4. Research Interests <ul><li>Text analysis, machine learning, intelligent information retrieval, summarization, question answering, bioinformatics, medical informatics </li></ul>
  5. 5. Contact information <ul><li>Office hours: Monday and Wednesday 6:00-7:00pm and by appointment </li></ul><ul><li>Office location: ECSS 3.406 </li></ul><ul><li>[email_address] </li></ul><ul><li>(972) 883-4342 </li></ul><ul><li>Teaching Assistant: TBA </li></ul>
  6. 6. Course goals <ul><li>Introduce the field of bioinformatics </li></ul><ul><li>Discuss primary techniques used for data mining </li></ul><ul><li>Introduce text mining and additional issues it brings to data mining methods </li></ul><ul><li>Use examples from computational biology </li></ul>
  7. 7. Intended audience <ul><li>For both computer scientists and biologists </li></ul><ul><li>Not an easy task to balance the two </li></ul><ul><li>Focus on data and text mining algorithms and applications </li></ul><ul><ul><li>Coverage of machine learning background </li></ul></ul><ul><ul><li>No extensive algorithmic analysis / computational complexity </li></ul></ul><ul><ul><li>Medium level of programming </li></ul></ul>
  8. 8. Prerequisites <ul><li>Officially CS 6325 – Introduction to Bioinformatics </li></ul><ul><li>Waived for this offering of the course </li></ul><ul><li>You should know </li></ul><ul><ul><li>Basic data structures (multidimensional arrays, hash tables, binary trees) </li></ul></ul><ul><ul><li>One high-level programming language and be able to adapt to a new one as needed </li></ul></ul><ul><ul><li>Be able to install and use external software packages </li></ul></ul>
  9. 9. You need not know <ul><li>Molecular biology </li></ul><ul><li>Machine learning </li></ul><ul><li>Data mining (in general) </li></ul><ul><li>Text analysis / natural language processing </li></ul><ul><li>Information retrieval </li></ul><ul><li>Artificial intelligence </li></ul>
  10. 10. Course level <ul><li>Introductory graduate course (MS or first-year PhD) </li></ul><ul><li>Maturity in programming and data structures as of a Computer Science senior </li></ul><ul><li>Ability (and interest) in accessing the primary literature in a guided fashion </li></ul>
  11. 11. Course structure <ul><li>6 lectures on biological background and bioinformatics in general </li></ul><ul><li>6 lectures on data similarity </li></ul><ul><li>8 lectures on data mining methods </li></ul><ul><li>2 lectures on text mining and knowledge mining methods </li></ul><ul><li>student presentations of research projects (2 sessions) </li></ul>
  12. 12. Expected work load <ul><li>Three to four homework sets (approximately one for each block of lectures) </li></ul><ul><li>Two weeks to turn in each homework set </li></ul><ul><li>Mid-term exam in early October </li></ul><ul><li>Students select project topic in late October </li></ul><ul><li>Student presentations of projects in the first week of December </li></ul><ul><li>Final exam </li></ul>
  13. 13. Course project <ul><li>Students work on a project in teams of two or three </li></ul><ul><li>Project is chosen by the students with the advice and consent of the instructor </li></ul><ul><li>Project investigation/implementation should be approximately two times per student the work required for a regular homework </li></ul>
  14. 14. Programming <ul><li>Each student selects their own programming language (must be available at UTD and accessible to TA) </li></ul><ul><li>Examples: C, C++, Java, Perl, Python </li></ul><ul><li>Can also use a package/programming environment specifically tailored to bioinformatics </li></ul>
  15. 15. One likely package <ul><li>R ( ) </li></ul><ul><li>R is the free alternative to S-Plus developed at AT&T research </li></ul><ul><li>S-Plus is the extensible, programmable alternative to statistical packages like SAS and SPSS </li></ul><ul><li>If you know C, you will be right at home with R </li></ul>
  16. 16. Another likely package <ul><li>BioPerl ( ) </li></ul><ul><li>A collection of library modules in Perl written by and for bioinformaticians </li></ul><ul><li>Perl supports high-level operations such as hashes as a basic data structure, string matching, and regular expressions </li></ul><ul><li>Perl is really bad at OOP and efficiency </li></ul><ul><li>Easy to learn </li></ul>
  17. 17. Grading <ul><li>Class participation: 20% </li></ul><ul><li>Homework assignments: 30% (total) </li></ul><ul><li>Midterm: 10% </li></ul><ul><li>Research project: 20% </li></ul><ul><li>Final exam: 20% </li></ul>
  18. 18. Textbooks <ul><li>No good integrated textbook on data mining from a computational biology perspective </li></ul><ul><li>We will use a text book covering bioinformatics algorithms and another text book on data mining in general, and additional chapters from other books and research articles </li></ul><ul><li>Copies of chapters / research articles will be provided </li></ul>
  19. 19. Recommended textbook #1 <ul><li>“ An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)”, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004. </li></ul><ul><li>ISBN 0262101068 </li></ul><ul><li>448 pages </li></ul><ul><li>Available on for $49, Barnes and Noble for $49 </li></ul>
  20. 20. Recommended textbook #2 <ul><li>“ Data Mining : Concepts and Techniques” by Jiawei Han and Micheline Kamber, Elsevier, second edition, 2006. </li></ul><ul><li>ISBN 1558609016 </li></ul><ul><li>800 pages </li></ul><ul><li>Available on for $55, Barnes and Noble for $55 </li></ul>
  21. 21. Supplementary textbooks <ul><li>“ Bioinformatics: The Machine Learning Approach” by Pierre Baldi and Soren Brunak, 2 nd edition, 2001. </li></ul><ul><li>“ Data mining : multimedia, soft computing, and bioinformatics” by Sushmita Mitra and Tinku Acharya, 2003. </li></ul><ul><li>Both of the above are available as full-text eBooks via . </li></ul>
  22. 22. Background reading <ul><li>Biology: “Molecular Biology of the Cell” by Bruce Alberts et al., 4 th edition, 2002. </li></ul><ul><li>Machine learning: “Machine Learning” by Tom Mitchell, 1997. </li></ul>
  23. 23. Background reading (II) <ul><li>Statistics: “The elements of statistical learning: data mining, inference, and prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2001. </li></ul><ul><li>Data structures and algorithms: “Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2 nd edition, 2001. </li></ul>
  24. 24. So what is it all about? <ul><li>Three parts: </li></ul><ul><ul><li>Bioinformatics / computational biology </li></ul></ul><ul><ul><li>Data mining </li></ul></ul><ul><ul><li>Text mining </li></ul></ul>
  25. 25. Bioinformatics <ul><li>A fast developing discipline </li></ul><ul><li>We will discuss </li></ul><ul><ul><li>basic concepts of molecular biology </li></ul></ul><ul><ul><li>databases of biological data </li></ul></ul><ul><ul><li>structure and function of DNA, RNA, proteins </li></ul></ul><ul><ul><li>sequence searching (BLAST) </li></ul></ul><ul><ul><li>sequence similarity and comparison </li></ul></ul><ul><ul><li>protein structure (2D and 3D) </li></ul></ul><ul><ul><li>protein motifs and patterns </li></ul></ul><ul><ul><li>microarrays </li></ul></ul><ul><ul><li>phylogenetics </li></ul></ul>
  26. 26. Data mining <ul><li>Given a large amount of data of known types, extract useful information </li></ul><ul><li>We will discuss </li></ul><ul><ul><li>data cleanup and outliers </li></ul></ul><ul><ul><li>model construction </li></ul></ul><ul><ul><li>data and dimensionality reduction </li></ul></ul><ul><ul><li>classification </li></ul></ul><ul><ul><li>prediction / probability estimation </li></ul></ul><ul><ul><li>clustering </li></ul></ul><ul><ul><li>measuring performance </li></ul></ul>
  27. 27. Text mining <ul><li>Not only we have a large amount of raw data, but we don’t know what each item means </li></ul><ul><li>We will discuss: </li></ul><ul><ul><li>tokenization and basics of text processing </li></ul></ul><ul><ul><li>recognition of terms and entities </li></ul></ul><ul><ul><li>classification </li></ul></ul><ul><ul><li>dictionary creation </li></ul></ul><ul><ul><li>relationship learning and extraction </li></ul></ul><ul><ul><li>document level clustering and information retrieval </li></ul></ul>
  28. 28. We will be talking about ... <ul><li>What biology is </li></ul><ul><li>From organisms to cells and their contents </li></ul><ul><li>Basic building blocks: </li></ul><ul><ul><li>proteins, DNA, RNA </li></ul></ul><ul><li>Basic cell processes: </li></ul><ul><ul><li>replication, transcription, translation, regulation </li></ul></ul><ul><li>Goals of biology </li></ul>