Your SlideShare is downloading. ×
Data and Text Mining for Computational Biology
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data and Text Mining for Computational Biology

409
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
409
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data and Text Mining for Computational Biology Introduction
  • 2. Course information
    • CS 6365
    • Data and Text Mining for Computational Biology
    • Meets Monday and Wednesday 4:00-5:15 pm at ECSS 2.203
  • 3. Instructor
    • Vasileios Hatzivassiloglou
    • Associate Professor, Computer Science
    • Founding Professor, Bioengineering
    • Research focus: Discover knowledge from massive amounts of raw data
      • data not the same as information
      • information overload
  • 4. Research Interests
    • Text analysis, machine learning, intelligent information retrieval, summarization, question answering, bioinformatics, medical informatics
  • 5. Contact information
    • Office hours: Monday and Wednesday 6:00-7:00pm and by appointment
    • Office location: ECSS 3.406
    • [email_address]
    • (972) 883-4342
    • Teaching Assistant: TBA
  • 6. Course goals
    • Introduce the field of bioinformatics
    • Discuss primary techniques used for data mining
    • Introduce text mining and additional issues it brings to data mining methods
    • Use examples from computational biology
  • 7. Intended audience
    • For both computer scientists and biologists
    • Not an easy task to balance the two
    • Focus on data and text mining algorithms and applications
      • Coverage of machine learning background
      • No extensive algorithmic analysis / computational complexity
      • Medium level of programming
  • 8. Prerequisites
    • Officially CS 6325 – Introduction to Bioinformatics
    • Waived for this offering of the course
    • You should know
      • Basic data structures (multidimensional arrays, hash tables, binary trees)
      • One high-level programming language and be able to adapt to a new one as needed
      • Be able to install and use external software packages
  • 9. You need not know
    • Molecular biology
    • Machine learning
    • Data mining (in general)
    • Text analysis / natural language processing
    • Information retrieval
    • Artificial intelligence
  • 10. Course level
    • Introductory graduate course (MS or first-year PhD)
    • Maturity in programming and data structures as of a Computer Science senior
    • Ability (and interest) in accessing the primary literature in a guided fashion
  • 11. Course structure
    • 6 lectures on biological background and bioinformatics in general
    • 6 lectures on data similarity
    • 8 lectures on data mining methods
    • 2 lectures on text mining and knowledge mining methods
    • student presentations of research projects (2 sessions)
  • 12. Expected work load
    • Three to four homework sets (approximately one for each block of lectures)
    • Two weeks to turn in each homework set
    • Mid-term exam in early October
    • Students select project topic in late October
    • Student presentations of projects in the first week of December
    • Final exam
  • 13. Course project
    • Students work on a project in teams of two or three
    • Project is chosen by the students with the advice and consent of the instructor
    • Project investigation/implementation should be approximately two times per student the work required for a regular homework
  • 14. Programming
    • Each student selects their own programming language (must be available at UTD and accessible to TA)
    • Examples: C, C++, Java, Perl, Python
    • Can also use a package/programming environment specifically tailored to bioinformatics
  • 15. One likely package
    • R ( http://www.r-project.org/ )
    • R is the free alternative to S-Plus developed at AT&T research
    • S-Plus is the extensible, programmable alternative to statistical packages like SAS and SPSS
    • If you know C, you will be right at home with R
  • 16. Another likely package
    • BioPerl ( http://bio.perl.org/ )
    • A collection of library modules in Perl written by and for bioinformaticians
    • Perl supports high-level operations such as hashes as a basic data structure, string matching, and regular expressions
    • Perl is really bad at OOP and efficiency
    • Easy to learn
  • 17. Grading
    • Class participation: 20%
    • Homework assignments: 30% (total)
    • Midterm: 10%
    • Research project: 20%
    • Final exam: 20%
  • 18. Textbooks
    • No good integrated textbook on data mining from a computational biology perspective
    • We will use a text book covering bioinformatics algorithms and another text book on data mining in general, and additional chapters from other books and research articles
    • Copies of chapters / research articles will be provided
  • 19. Recommended textbook #1
    • “ An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)”, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004.
    • ISBN 0262101068
    • 448 pages
    • Available on Amazon.com for $49, Barnes and Noble for $49
  • 20. Recommended textbook #2
    • “ Data Mining : Concepts and Techniques” by Jiawei Han and Micheline Kamber, Elsevier, second edition, 2006.
    • ISBN 1558609016
    • 800 pages
    • Available on Amazon.com for $55, Barnes and Noble for $55
  • 21. Supplementary textbooks
    • “ Bioinformatics: The Machine Learning Approach” by Pierre Baldi and Soren Brunak, 2 nd edition, 2001.
    • “ Data mining : multimedia, soft computing, and bioinformatics” by Sushmita Mitra and Tinku Acharya, 2003.
    • Both of the above are available as full-text eBooks via http://library.utdallas.edu .
  • 22. Background reading
    • Biology: “Molecular Biology of the Cell” by Bruce Alberts et al., 4 th edition, 2002.
    • Machine learning: “Machine Learning” by Tom Mitchell, 1997.
  • 23. Background reading (II)
    • Statistics: “The elements of statistical learning: data mining, inference, and prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2001.
    • Data structures and algorithms: “Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2 nd edition, 2001.
  • 24. So what is it all about?
    • Three parts:
      • Bioinformatics / computational biology
      • Data mining
      • Text mining
  • 25. Bioinformatics
    • A fast developing discipline
    • We will discuss
      • basic concepts of molecular biology
      • databases of biological data
      • structure and function of DNA, RNA, proteins
      • sequence searching (BLAST)
      • sequence similarity and comparison
      • protein structure (2D and 3D)
      • protein motifs and patterns
      • microarrays
      • phylogenetics
  • 26. Data mining
    • Given a large amount of data of known types, extract useful information
    • We will discuss
      • data cleanup and outliers
      • model construction
      • data and dimensionality reduction
      • classification
      • prediction / probability estimation
      • clustering
      • measuring performance
  • 27. Text mining
    • Not only we have a large amount of raw data, but we don’t know what each item means
    • We will discuss:
      • tokenization and basics of text processing
      • recognition of terms and entities
      • classification
      • dictionary creation
      • relationship learning and extraction
      • document level clustering and information retrieval
  • 28. We will be talking about ...
    • What biology is
    • From organisms to cells and their contents
    • Basic building blocks:
      • proteins, DNA, RNA
    • Basic cell processes:
      • replication, transcription, translation, regulation
    • Goals of biology