notes #1
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
585
On Slideshare
585
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to class 1/4/2010 TCSS588A Isabelle Bichindaritz
  • 2. Outline
    • Introduction to class
    • Introduction to machine learning / data mining
    • Introduction to the Life Sciences
    • Example and importance of microarray data
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 3. Introduction to Class
    • This class focuses on learning how to apply data mining to biological and medical fields to solve some of their problems.
    • Does not require prior knowledge in the application areas.
    • Does not require prior knowledge in machine learning and/or data mining.
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 4. Introduction to Class
    • Data mining specialized in
      • Statistical data analysis and inference – SPSS, R-language
      • Clustering – SPSS, Gene Pattern
      • Machine learning - Rapid Miner
      • Classification – Rapid Miner ,R-language.
    • Requirement: use biological datasets and/or medical datasets.
    • Seattle area has many renowned research institutes.
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 5. 1/4/2010 TCSS588A Isabelle Bichindaritz Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
  • 6. The Human Genome Project
    • The Human Genome Project
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 7. Data Mining Motivation: “Necessity is the Mother of Invention”
    • Data explosion problem
      • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
    • We are drowning in data, but starving for knowledge!
    • Solution: Data warehousing and data mining
      • Data warehousing and on-line analytical processing
      • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 8. What Is Data Mining?
    • Data mining (knowledge discovery in databases):
      • Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information or patterns from data in large databases
    • Alternative names and their “inside stories”:
      • Data mining: a misnomer?
      • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
    • What is not data mining?
      • (Deductive) query processing.
      • Expert systems or small ML/statistical programs are often a part of data mining
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 9. What Is Data Mining?
    • Data mining (knowledge discovery in databases) is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.
    • Machine learning and knowledge discovery are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 10. Data Mining: A KDD Process
      • Data mining: the core of knowledge discovery process.
    1/4/2010 TCSS588A Isabelle Bichindaritz Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
  • 11. Machine Learning Functionalities (1)
    • Concept description: Characterization and discrimination
      • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
    • Association ( correlation and causality)
      • Multi-dimensional vs. single-dimensional association
      • age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%]
      • contains(T, “computer”)  contains(x, “software”) [1%, 75%]
      • Diaper  Beer [0.5%, 75%]
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 12. Machine Learning Functionalities (2)
    • Classification and Prediction
      • Finding models (functions) that describe and distinguish classes or concepts for future prediction
      • E.g., classify countries based on climate, or classify cars based on gas mileage
      • Presentation: decision-tree, classification rule, neural network
      • Prediction: Predict some unknown or missing numerical values
    • Cluster analysis
      • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
      • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 13. Machine Learning Functionalities (3)
    • Outlier analysis
      • Outlier: a data object that does not comply with the general behavior of the data
      • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis
    • Trend and evolution analysis
      • Trend and deviation: regression analysis
      • Sequential pattern mining, periodicity analysis
      • Similarity-based analysis
    • Other pattern-directed or statistical analyses
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 14. Are All the “Discovered” Patterns Interesting?
    • A data mining or machine learning system/query may generate thousands of patterns, not all of them are interesting.
      • Suggested approach: Human-centered, query-based, focused mining
    • Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful , novel, or validates some hypothesis that a user seeks to confirm
    • Objective vs. subjective interestingness measures:
      • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
      • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 15. Can We Find All and Only Interesting Patterns?
    • Find all the interesting patterns: Completeness
      • Can a data mining or machine learning system find all the interesting patterns?
      • Association vs. classification vs. clustering
    • Search for only interesting patterns: Optimization
      • Can a data mining or machine learning system find only the interesting patterns?
      • Approaches
        • First general all the patterns and then filter out the uninteresting ones.
        • Generate only the interesting patterns — mining query optimization
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 16. Data Mining: Confluence of Multiple Disciplines 1/4/2010 TCSS588A Isabelle Bichindaritz Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
  • 17. Data Mining: Classification Schemes
    • General functionality
      • Descriptive data mining
      • Predictive data mining
    • Different views, different classifications
      • Kinds of databases to be mined
      • Kinds of knowledge to be discovered
      • Kinds of techniques utilized
      • Kinds of applications adapted
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 18. Architecture of a Typical Data Mining System 1/4/2010 TCSS588A Isabelle Bichindaritz Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  • 19. Introduction to the Life Sciences
    • What is human DNA ?
      • DNA stands for DeoxyriboNucleic Acid
      • DNA stores the genetic material chromosomes in each cell nucleus
      • DNA is transcribed into RNA out of the nucleus ( transcription )
      • RNA stands for RiboNucleic Acid
      • RNA is translated into proteins in a cytoplasm organism called a ribosome ( translation )
      • DNA  RNA  proteins
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 20. Introduction to the Life Sciences 1/4/2010 TCSS588A Isabelle Bichindaritz DNA mRNA rRNA tRNA transcription Ribosome Protein translation
  • 21. Introduction to the Life Sciences
    • Gene expressions are any molecular compound produced from genes (ex: RNA) Genes are expressed by being transcribed into RNA, and this transcript may then be translated into protein.
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 22. Introduction to the Life Sciences
    • DNA and RNA are composed of
      • Nucleotides (nucleic acid molecules)
        • Pyrimidines
          • Cytosine (C) (DNA & RNA)
          • Thymine (T) (DNA)
          • Uracil (U) (RNA)
        • purines
          • Adenine (A) (DNA & RNA)
          • Guanine (G) (DNA & RNA)
      • Oses (Ribose for RNA, Deoxyribose for DNA)
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 23. Introduction to the Life Sciences
    • Succession of nucleotides composes a single strand in DNA
    • Two strands of DNA pair themselves in the 3-D shape of a double helix, where bases are paired (bp = base pair)
    • Pairing of the bases (A=T, G C) provides chemical bonds responsible for the double helix shape.
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 24. Introduction to the Life Sciences 1/4/2010 TCSS588A Isabelle Bichindaritz
  • 25. 1/4/2010 TCSS588A Isabelle Bichindaritz Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
  • 26. Introduction to the Life Sciences
    • Genes
      • A gene is a part of the genome that can be translated
      • A gene may encode a protein or RNA sequence
      • Genes are separated by non coding regions
      • Genes are concentrated in certain regions of the genome rich in G and C
      • Regions rich in A and T do not contain genes
      • Between the two, CpG islands (repetition of C and G) separate coding regions from non coding ones
      • Non coding regions can be parts of genes
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 27. Introduction to the Life Sciences
    • Genomes, diversity, size, structure
      • Profound diversity of living organisms genome.
      • DNA (cells), DNA or RNA (phage, virus)
      • Direction: from 5’ to 3’ of molecule (double stranded DNA), or both directions (single stranded)
      • Genome organized or not in chromosomes
      • Human genome: 22 chromosomes, 3 billion bases, 30,000 genes
      • Other species genome vary in size and number of genes
      • Human genome has only twice as many genes than a primitive worm
      • GenBank database
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 28. Introduction to the Life Sciences
    • Proteomes
      • The proteome is the set of proteins that can be expressed from a genome
      • Determination of:
        • Sequence of encoding genes
        • Location of the genes
        • Function of protein encoding genes
        • Different biochemical states (phosphorylation, glycosylation, co-enzymes…)
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 29. Introduction to the Life Sciences
    • Gene ontologies
      • Gene ontology consortium
        • Dynamic controlled vocabulary to describe
          • Molecular function (Ex: DNA polymerase, …)
          • Biological process (Ex: DNA synthesis, respiration, …)
          • Cellular component (Ex: nucleus, ribosome, …)
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 30. Principles of Bioinformatics
    • Biological information
      • Molecules at the basis of life can be represented as digital symbol strings (DNA, RNA, …)
      • Digital symbols (monomers) constitute an alphabet
      • Unique representation
      • Importance of probabilistic models
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 31. Principles of Bioinformatics
    • Database annotation quality
      • In addition to natural noise, data are distorted by people’s annotations (curation of the data)
      • Resulting error is very significant
      • Reasons:
        • Storage of positions in a sequence, not content
        • Difficulty of storing content
      • Need to check the data
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 32. Principles of Bioinformatics
    • Database redundancy
      • Different representations: RNA, cDNA (corresponding complementary)
      • Different methods: single-pass sequence, multi-fold repetition of a sequence
      • Different fragments: pre-mRNA can lead to several levels of splicing in cDNA, alternative splicing
      • Redundancy is source of error:
        • Bias of over represented fragments for closely related segments
        • Bias of over represented fragments for correlations
        • Overestimate prediction if input and output are related
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 33. Principles of Bioinformatics
    • Database redundancy
      • Better to clean the data first
      • Data mining cleaning methods apply
      • Difficulty to differentiate between true analogous sequences, and related ones
      • Sequence profile describes amino acid variation in a family of sequences
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 34. Principles of Bioinformatics
    • Main bioinformatics questions
      • Determine the exact transition between coding and non coding regions of genes
      • Find genes in prokaryotes and eukaryotes
      • Determine transcription initiation and termination
      • Sequence clustering and cluster topology
      • Protein structure prediction
      • Protein function prediction
      • Protein family classification
    1/4/2010 TCSS588A Isabelle Bichindaritz
  • 35. Principles of Bioinformatics
    • Question
      • Propose questions pertinent for bioinformatics
      • Propose questions pertinent for medical informatics
    1/4/2010 TCSS588A Isabelle Bichindaritz