Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



1 Embed 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Presentation Presentation Presentation Transcript

  • From Data Mining to Knowledge Fitting Joost N. Kok, Leiden Institute of Advanced Computer Science
  • Information Ladder
    • Data
    • Information
    • Knowledge
    • Understanding
    • Insight
    • Wisdom
    Monday, May 10, 2010
  • Data Mining definitions
    • Secondary analysis of data
    • Induction of understandable useful models and patterns from data
    • Algorithms for large quantities of data
    • Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data
    useful novel, surprising comprehensible valid (accurate)
  • Data Mining
    • Data Mining = Data Search using a Knowledge Bias
  • Data Mining
    • Data Mining is somewhat comparable to Statistics (and is often based on it), but takes it further in the sense that whereas
      • statistics aims more at validating given hypotheses
      • in data mining often millions of potential patterns are generated and tested, in the hope of finding some that are potentially useful
  • Typical Data Mining Results
    • Forecasting what may happen in the future
    • Classifying people or things into groups by recognizing patterns
    • Clustering people or things into groups based on their attributes
    • Associating what events are likely to occur together
    • Sequencing what events are likely to lead to later events
  • Different types of problems
    • “ Data mining” problems / tasks often fall in one of the following categories:
      • Classification
      • Regression
      • Clustering
      • Discovering associations
      • Probabilistic modelling
  • From “Querying” to “Mining” Are there any occurrences of GAAT in this string? How many occurrences of AAT are there in this string? Which substrings of length 4 occur at least 2 times? Which substrings (of any length) occur significantly moreoften in the white string than in the black string? Standard database technology solves such questions Data mining technology can sometimes solve such questions (computations may be (too) heavy) Science fiction Why is the virus to the left resistant to my drug, and the one to the right not?
  • Scientific Data Lifecycle Monday, May 10, 2010
  • Scientific Data Lifecycle Monday, May 10, 2010
  • Databases Ontologies Integration Disambiguation Data Knowledge Discovery tools KDD Data mining Statistics Knowledge Fitting
  • Building Blocks
  • Link Integration Source Source Source Source
  • Federated Database Distributor Source Source Source Source
  • Data Warehousing Source Source Source Source Warehouse
  • Scripting Languages
    • A scripting language is a programming language that controls software applications.
    • Examples: Python, Perl
    • Standards for uniform access to databases
  • Ontologies
    • Ontology is about the description of things and their relationships.
    • Ontologies are taxonomies that define concepts and relationships among them.
    • The subclass / is-a relationship is most predominant in ontologies
  • OWL = Web Ontology Language
  • Building Blocks
  • Service Orientation
    • SOA = Service-Oriented Architecture
    • SOA: Distributed Software Architecture that allows for building applications through individual component composition
  • Visualisation
    • Intelligent Data Analysis
      • Intelligent = Methods
      • Intelligent = Human Interaction
    • First step:
      • visualisation of the data
  • DNA Visualisation
    • Long patterns over small alphabets are hard to find …
    • ababababababababababababababababababababababa . . .
    • (ab) 
    • abbbababaaababbabbbababaaababbabbbababaaababb . . .
    • (abbbababaaababb) 
    • abaaaababbbbabaaaababbbbabaaaababbbbabaaaabab . . .
    • (abaaaa · babbbb) 
  • DNA Visualisation
    • Associate each nucleotide with a dimension
    • Four nucleotides => four dimensions
    • Build a structure in four dimensions
    • Project to three dimensions
  • DNA Visualisation
    • We expect to see the following things in the projection:
      • A non-predictable walk for information rich parts of the DNA
      • A true random walk for random parts
      • Lines (or approximate lines) for repeating parts of the DNA
      • Large identical substrings in the DNA can easily detected
  • DNA Visualisation
    • Select four three-dimensional vectors.
      • The vectors should be of comparable length
      • The four vectors should add up to 0
      • Every subset of three vectors should be independent.
  • DNA Visualisation
  • The first 160,000 nucleotides of the human Y-chromosome
  • The first 160,000 nucleotides of the human Y-chromosome
  • 40,000–100,000 of the chromosome 1 (human)
  • DNA Visualisation
    • Simple, large and extremely large (approximate) repeats can easily be detected
    • Demo
    • http:// /
  • Data Mining
  • Subgroup Discovery
    • How to find comprehensible subgroups in large amounts of data?
    • As an example: subtypes in complex diseases
    • Different types of input
  • Classi fication versus Subgroup Discovery + + + + + + + + + + + + + + + + + + + + + +
  • Classification vs Subgroup D iscovery
    • Classification
      • predictive induction
      • constructing sets of classification rules
      • aimed at learning a model for classification or prediction
      • rules are dependent
    • Subgroup D iscovery
      • descriptive induction
      • constructing individual subgroup - describing rules
      • aimed at finding interesting patterns in target class examples
  • Towards Knowledge Fitting
    • Trends:
      • A lot of valuable data is not any longer being shared due to various reasons: privacy issues, data is difficult and expensive to collect, etc.
      • The amount of publicly available knowledge increases daily.
      • Patterns and models need to be complemented with knowledge that convinces the user.
  • Knowledge Fitting = Knowledge Mining using a Data Bias Data Mining = Data Search using a Knowledge Bias
    • Prepare the data
    • Model the subgroups
    • Characterize and compare the subgroups
    • Evaluate the subgroups
    • Package available in R
    Subgroup Mining Scenario
  • Group Modeling
    • Model based cluster analysis.
    • The data is modeled by a mixture of Gaussians.
    • Many models, many BIC scores.
  • Group Characteristics
  • Subgroup Evaluation
    • We report in tables statistical results and generalization estimates
  • Gene Expression Data
    • Genomics: the study of genes and their function
    • MicroArray Data
      • a very large number of attributes (genes) relative to the number of examples (observations)
      • typical values : 7000-16000 attributes, 50-150 examples
  • Gene Expression Data few cases many features … #1 #2 #100 /71
  • Ranking of differentially expressed genes The genes are ordered in a ranked list, according to their differential expression between the classes. The challenge is to extract meaning from this list, to describe subgroups. The conjunction of terms of ontologies are used as a vocabulary for the description of sets of genes. .
  • Subgroup Discovery
    • Discovery of gene subgroups which
      • are “higher” in the ranked list
      • can be compactly summarized using
        • knowledge (GO, ENTREZ , KEGG)
        • Interactions between genes
  • Enrichment Score
  • Descriptions
    • FANTOM = F requent p A tter N T ree-based O ntology M iner
    • FANTOM is a knowledge fitting tool that uncovers “interesting” descriptions of gene sets
      • Interesting: high Gene Set Enrichment Score
      • Search for patterns is exhaustive
  • Inputs
    • FANTOM takes as inputs:
      • A ranked list of genes (default ID is from ENTREZ), together with a score.
      • Ontologies (default are GO and KEGG)
      • Mappings (to map ENTREZ or another ID to the ontologies)
      • Interaction data (if available)
      • Cutoffs
        • minimum GSES
        • minimum amount of gene participants in a rule
  • Typical Statistics
    • Experiment comparing two different mouse hearts:
      • Generated rule options: 200k-2m
      • Actual rules: 10-40k
      • Rules after pruning: 5-500
      • Runtime: 5 minutes - 4 hours
  • Knowledge Fitting = Knowledge Mining using a Data Bias Data Mining = Data Search using a Knowledge Bias
  • Intelligent Bridges Movies
  • Cyttron
    • The Cyttron consortium aims at developing a "super microscope", imaging the living cell with atomic resolution.
    • Images gathered with X–ray diffraction, electron microscopy, and other sources will be combined through advanced software solutions.
    • The Computer Science Institute of Leiden University
    • Leiden Institute of Advanced Computer Science
  • Research Clusters
    • Algorithms
    • Foundations of Software Technology
    • Computer Systems
    • Imagery and Media
    • Technology and Innovation Management
  • Acknowledgements
    • Jeroen Laros (LIACS)
    • Jeroen de Bruin (LIACS)
    • Fabrice Colas (LIACS)
    • Nada Lavrac (JSI)
    • Igor Trajkovski (JSI)
    • Jan Bot (TU Delft)
    • Ingrid Meulebelt (LUMC)
    • Eline Slagboom (LUMC)
    • Peter-Bram ‘t Hoen (LUMC)
    • Tineke van Veen (LUMC)
    • Stephanie van Roden (LUMC)
  • Algorithms Cluster @ LIACS