notes #1


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

notes #1

  1. 1. Introduction to class 1/4/2010 TCSS588A Isabelle Bichindaritz
  2. 2. Outline <ul><li>Introduction to class </li></ul><ul><li>Introduction to machine learning / data mining </li></ul><ul><li>Introduction to the Life Sciences </li></ul><ul><li>Example and importance of microarray data </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  3. 3. Introduction to Class <ul><li>This class focuses on learning how to apply data mining to biological and medical fields to solve some of their problems. </li></ul><ul><li>Does not require prior knowledge in the application areas. </li></ul><ul><li>Does not require prior knowledge in machine learning and/or data mining. </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  4. 4. Introduction to Class <ul><li>Data mining specialized in </li></ul><ul><ul><li>Statistical data analysis and inference – SPSS, R-language </li></ul></ul><ul><ul><li>Clustering – SPSS, Gene Pattern </li></ul></ul><ul><ul><li>Machine learning - Rapid Miner </li></ul></ul><ul><ul><li>Classification – Rapid Miner ,R-language. </li></ul></ul><ul><li>Requirement: use biological datasets and/or medical datasets. </li></ul><ul><li>Seattle area has many renowned research institutes. </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  5. 5. 1/4/2010 TCSS588A Isabelle Bichindaritz Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
  6. 6. The Human Genome Project <ul><li>The Human Genome Project </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  7. 7. Data Mining Motivation: “Necessity is the Mother of Invention” <ul><li>Data explosion problem </li></ul><ul><ul><li>Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories </li></ul></ul><ul><li>We are drowning in data, but starving for knowledge! </li></ul><ul><li>Solution: Data warehousing and data mining </li></ul><ul><ul><li>Data warehousing and on-line analytical processing </li></ul></ul><ul><ul><li>Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  8. 8. What Is Data Mining? <ul><li>Data mining (knowledge discovery in databases): </li></ul><ul><ul><li>Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information or patterns from data in large databases </li></ul></ul><ul><li>Alternative names and their “inside stories”: </li></ul><ul><ul><li>Data mining: a misnomer? </li></ul></ul><ul><ul><li>Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. </li></ul></ul><ul><li>What is not data mining? </li></ul><ul><ul><li>(Deductive) query processing. </li></ul></ul><ul><ul><li>Expert systems or small ML/statistical programs are often a part of data mining </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  9. 9. What Is Data Mining? <ul><li>Data mining (knowledge discovery in databases) is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. </li></ul><ul><li>Machine learning and knowledge discovery are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery. </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  10. 10. Data Mining: A KDD Process <ul><ul><li>Data mining: the core of knowledge discovery process. </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
  11. 11. Machine Learning Functionalities (1) <ul><li>Concept description: Characterization and discrimination </li></ul><ul><ul><li>Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions </li></ul></ul><ul><li>Association ( correlation and causality) </li></ul><ul><ul><li>Multi-dimensional vs. single-dimensional association </li></ul></ul><ul><ul><li>age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%] </li></ul></ul><ul><ul><li>contains(T, “computer”)  contains(x, “software”) [1%, 75%] </li></ul></ul><ul><ul><li>Diaper  Beer [0.5%, 75%] </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  12. 12. Machine Learning Functionalities (2) <ul><li>Classification and Prediction </li></ul><ul><ul><li>Finding models (functions) that describe and distinguish classes or concepts for future prediction </li></ul></ul><ul><ul><li>E.g., classify countries based on climate, or classify cars based on gas mileage </li></ul></ul><ul><ul><li>Presentation: decision-tree, classification rule, neural network </li></ul></ul><ul><ul><li>Prediction: Predict some unknown or missing numerical values </li></ul></ul><ul><li>Cluster analysis </li></ul><ul><ul><li>Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns </li></ul></ul><ul><ul><li>Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  13. 13. Machine Learning Functionalities (3) <ul><li>Outlier analysis </li></ul><ul><ul><li>Outlier: a data object that does not comply with the general behavior of the data </li></ul></ul><ul><ul><li>It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis </li></ul></ul><ul><li>Trend and evolution analysis </li></ul><ul><ul><li>Trend and deviation: regression analysis </li></ul></ul><ul><ul><li>Sequential pattern mining, periodicity analysis </li></ul></ul><ul><ul><li>Similarity-based analysis </li></ul></ul><ul><li>Other pattern-directed or statistical analyses </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  14. 14. Are All the “Discovered” Patterns Interesting? <ul><li>A data mining or machine learning system/query may generate thousands of patterns, not all of them are interesting. </li></ul><ul><ul><li>Suggested approach: Human-centered, query-based, focused mining </li></ul></ul><ul><li>Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful , novel, or validates some hypothesis that a user seeks to confirm </li></ul><ul><li>Objective vs. subjective interestingness measures: </li></ul><ul><ul><li>Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. </li></ul></ul><ul><ul><li>Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  15. 15. Can We Find All and Only Interesting Patterns? <ul><li>Find all the interesting patterns: Completeness </li></ul><ul><ul><li>Can a data mining or machine learning system find all the interesting patterns? </li></ul></ul><ul><ul><li>Association vs. classification vs. clustering </li></ul></ul><ul><li>Search for only interesting patterns: Optimization </li></ul><ul><ul><li>Can a data mining or machine learning system find only the interesting patterns? </li></ul></ul><ul><ul><li>Approaches </li></ul></ul><ul><ul><ul><li>First general all the patterns and then filter out the uninteresting ones. </li></ul></ul></ul><ul><ul><ul><li>Generate only the interesting patterns — mining query optimization </li></ul></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  16. 16. Data Mining: Confluence of Multiple Disciplines 1/4/2010 TCSS588A Isabelle Bichindaritz Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
  17. 17. Data Mining: Classification Schemes <ul><li>General functionality </li></ul><ul><ul><li>Descriptive data mining </li></ul></ul><ul><ul><li>Predictive data mining </li></ul></ul><ul><li>Different views, different classifications </li></ul><ul><ul><li>Kinds of databases to be mined </li></ul></ul><ul><ul><li>Kinds of knowledge to be discovered </li></ul></ul><ul><ul><li>Kinds of techniques utilized </li></ul></ul><ul><ul><li>Kinds of applications adapted </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  18. 18. Architecture of a Typical Data Mining System 1/4/2010 TCSS588A Isabelle Bichindaritz Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  19. 19. Introduction to the Life Sciences <ul><li>What is human DNA ? </li></ul><ul><ul><li>DNA stands for DeoxyriboNucleic Acid </li></ul></ul><ul><ul><li>DNA stores the genetic material chromosomes in each cell nucleus </li></ul></ul><ul><ul><li>DNA is transcribed into RNA out of the nucleus ( transcription ) </li></ul></ul><ul><ul><li>RNA stands for RiboNucleic Acid </li></ul></ul><ul><ul><li>RNA is translated into proteins in a cytoplasm organism called a ribosome ( translation ) </li></ul></ul><ul><ul><li>DNA  RNA  proteins </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  20. 20. Introduction to the Life Sciences 1/4/2010 TCSS588A Isabelle Bichindaritz DNA mRNA rRNA tRNA transcription Ribosome Protein translation
  21. 21. Introduction to the Life Sciences <ul><li>Gene expressions are any molecular compound produced from genes (ex: RNA) Genes are expressed by being transcribed into RNA, and this transcript may then be translated into protein. </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  22. 22. Introduction to the Life Sciences <ul><li>DNA and RNA are composed of </li></ul><ul><ul><li>Nucleotides (nucleic acid molecules) </li></ul></ul><ul><ul><ul><li>Pyrimidines </li></ul></ul></ul><ul><ul><ul><ul><li>Cytosine (C) (DNA & RNA) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Thymine (T) (DNA) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Uracil (U) (RNA) </li></ul></ul></ul></ul><ul><ul><ul><li>purines </li></ul></ul></ul><ul><ul><ul><ul><li>Adenine (A) (DNA & RNA) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Guanine (G) (DNA & RNA) </li></ul></ul></ul></ul><ul><ul><li>Oses (Ribose for RNA, Deoxyribose for DNA) </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  23. 23. Introduction to the Life Sciences <ul><li>Succession of nucleotides composes a single strand in DNA </li></ul><ul><li>Two strands of DNA pair themselves in the 3-D shape of a double helix, where bases are paired (bp = base pair) </li></ul><ul><li>Pairing of the bases (A=T, G C) provides chemical bonds responsible for the double helix shape. </li></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  24. 24. Introduction to the Life Sciences 1/4/2010 TCSS588A Isabelle Bichindaritz
  25. 25. 1/4/2010 TCSS588A Isabelle Bichindaritz Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
  26. 26. Introduction to the Life Sciences <ul><li>Genes </li></ul><ul><ul><li>A gene is a part of the genome that can be translated </li></ul></ul><ul><ul><li>A gene may encode a protein or RNA sequence </li></ul></ul><ul><ul><li>Genes are separated by non coding regions </li></ul></ul><ul><ul><li>Genes are concentrated in certain regions of the genome rich in G and C </li></ul></ul><ul><ul><li>Regions rich in A and T do not contain genes </li></ul></ul><ul><ul><li>Between the two, CpG islands (repetition of C and G) separate coding regions from non coding ones </li></ul></ul><ul><ul><li>Non coding regions can be parts of genes </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  27. 27. Introduction to the Life Sciences <ul><li>Genomes, diversity, size, structure </li></ul><ul><ul><li>Profound diversity of living organisms genome. </li></ul></ul><ul><ul><li>DNA (cells), DNA or RNA (phage, virus) </li></ul></ul><ul><ul><li>Direction: from 5’ to 3’ of molecule (double stranded DNA), or both directions (single stranded) </li></ul></ul><ul><ul><li>Genome organized or not in chromosomes </li></ul></ul><ul><ul><li>Human genome: 22 chromosomes, 3 billion bases, 30,000 genes </li></ul></ul><ul><ul><li>Other species genome vary in size and number of genes </li></ul></ul><ul><ul><li>Human genome has only twice as many genes than a primitive worm </li></ul></ul><ul><ul><li>GenBank database </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  28. 28. Introduction to the Life Sciences <ul><li>Proteomes </li></ul><ul><ul><li>The proteome is the set of proteins that can be expressed from a genome </li></ul></ul><ul><ul><li>Determination of: </li></ul></ul><ul><ul><ul><li>Sequence of encoding genes </li></ul></ul></ul><ul><ul><ul><li>Location of the genes </li></ul></ul></ul><ul><ul><ul><li>Function of protein encoding genes </li></ul></ul></ul><ul><ul><ul><li>Different biochemical states (phosphorylation, glycosylation, co-enzymes…) </li></ul></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  29. 29. Introduction to the Life Sciences <ul><li>Gene ontologies </li></ul><ul><ul><li>Gene ontology consortium </li></ul></ul><ul><ul><ul><li>Dynamic controlled vocabulary to describe </li></ul></ul></ul><ul><ul><ul><ul><li>Molecular function (Ex: DNA polymerase, …) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Biological process (Ex: DNA synthesis, respiration, …) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Cellular component (Ex: nucleus, ribosome, …) </li></ul></ul></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  30. 30. Principles of Bioinformatics <ul><li>Biological information </li></ul><ul><ul><li>Molecules at the basis of life can be represented as digital symbol strings (DNA, RNA, …) </li></ul></ul><ul><ul><li>Digital symbols (monomers) constitute an alphabet </li></ul></ul><ul><ul><li>Unique representation </li></ul></ul><ul><ul><li>Importance of probabilistic models </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  31. 31. Principles of Bioinformatics <ul><li>Database annotation quality </li></ul><ul><ul><li>In addition to natural noise, data are distorted by people’s annotations (curation of the data) </li></ul></ul><ul><ul><li>Resulting error is very significant </li></ul></ul><ul><ul><li>Reasons: </li></ul></ul><ul><ul><ul><li>Storage of positions in a sequence, not content </li></ul></ul></ul><ul><ul><ul><li>Difficulty of storing content </li></ul></ul></ul><ul><ul><li>Need to check the data </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  32. 32. Principles of Bioinformatics <ul><li>Database redundancy </li></ul><ul><ul><li>Different representations: RNA, cDNA (corresponding complementary) </li></ul></ul><ul><ul><li>Different methods: single-pass sequence, multi-fold repetition of a sequence </li></ul></ul><ul><ul><li>Different fragments: pre-mRNA can lead to several levels of splicing in cDNA, alternative splicing </li></ul></ul><ul><ul><li>Redundancy is source of error: </li></ul></ul><ul><ul><ul><li>Bias of over represented fragments for closely related segments </li></ul></ul></ul><ul><ul><ul><li>Bias of over represented fragments for correlations </li></ul></ul></ul><ul><ul><ul><li>Overestimate prediction if input and output are related </li></ul></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  33. 33. Principles of Bioinformatics <ul><li>Database redundancy </li></ul><ul><ul><li>Better to clean the data first </li></ul></ul><ul><ul><li>Data mining cleaning methods apply </li></ul></ul><ul><ul><li>Difficulty to differentiate between true analogous sequences, and related ones </li></ul></ul><ul><ul><li>Sequence profile describes amino acid variation in a family of sequences </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  34. 34. Principles of Bioinformatics <ul><li>Main bioinformatics questions </li></ul><ul><ul><li>Determine the exact transition between coding and non coding regions of genes </li></ul></ul><ul><ul><li>Find genes in prokaryotes and eukaryotes </li></ul></ul><ul><ul><li>Determine transcription initiation and termination </li></ul></ul><ul><ul><li>Sequence clustering and cluster topology </li></ul></ul><ul><ul><li>Protein structure prediction </li></ul></ul><ul><ul><li>Protein function prediction </li></ul></ul><ul><ul><li>Protein family classification </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  35. 35. Principles of Bioinformatics <ul><li>Question </li></ul><ul><ul><li>Propose questions pertinent for bioinformatics </li></ul></ul><ul><ul><li>Propose questions pertinent for medical informatics </li></ul></ul>1/4/2010 TCSS588A Isabelle Bichindaritz
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.