Evolution of Knowledge Discovery and Management   Dr. A. Fazel Famili National Research Council of Canada Ottawa, ON K1A 0R6 Canada [email_address] October 28 th  2066
Outline Background The Knowledge Discovery Process Motivations for Knowledge Discovery Applications and some lessons learned The real evolution Summary
Sequence data:   C T A GG C T CC A G C T Time series The data mining process Discovered Knowledge: - Informative attributes - Thresholds - Relationships - Strength of Discovery -  Parametric data Sensors data Documents/images Experiment data Knowledge Discovery:  The process of discovering useful and  previously unknown knowledge from historical or real-time data -   Data Extraction and Selection - Data Pre-processing - Data Analysis (e.g.  Pattern Recognition) - Post-processing This is what I need!
Roots of knowledge Discovery Knowledge  Discovery Parallel  algorithms Machine  Learning High performance  computing Visualisation Database and Data Warehousing Data Visualization Applied  Statistics
Motivation Analysis capability (Software/Hardware Understanding the value of data Data production/ storage Knowledge Discovery
Knowledge Discovery Efforts Algorithm development Algorithm enhancements/extensions Benchmarking Development of KD tool boxes Real world applications Knowledge Discovery systems/software Generic Domain specific Batch processing vs on-line applications
Typical Applications of Data Mining Sales/Marketing - Supermarkets –  Provide better customer service –  Improve cross-selling opportunities (beer and nappies) –  Increase direct mail response rates Customer Retention - Banks Identify patterns of defection Predict likely defections Risk Assessment and Fraud Identify inappropriate or unusual behaviours Bioinformatics - Exploratory research Gene identification/gene response analysis/Disease modeling Management and operation of complex systems/ equipment Aerospace, e.g. identification and prediction of operation problems Process control  e.g. yield management
The real challenge: Bioinformatics - Genomics With the completion of Human Genome Map, > 30,000 genes in human  ~ 3 billion base pairs of sequences  (ACGT) to deal with, and … So many thousands in other species ,… How do they behave under different conditions? Identify gene functions and protein-protein interactions Discover gene responses to various conditions (e.g. environment, life) Technology advancements, high throughput biological experiments, genomics, proteomics, etc.
The real challenge (cont’d) Huge influx of data produced in Biotech and health care (e.g. >300,000 biochips by OCI alone, >500,000 from Affymetrix, plus Agilent, GE, etc). many efforts on building tumor banks… etc. Patient data becoming available in all forms for: Accurate diagnosis Better treatments Intelligent drug discovery and target validation Electronic documents containing reports, results of research. Many more species are unknown
 
Data in Genomics and Proteomics Genomics - Microarrays Data: Quatitative Qualitative Complex Multi-layered Incomplete Informative Proteomics MS 2D/3D GELs Protein Arrays Sequence data
Biological Data Analysis Normalization Interesting Results Differentially expressed genes Models  - Validation - Documentation Knowledge Discovery Microarray Data Data Pre-processing (Understanding the data)   Pattern Searching Supervised methods Unsupervised methods
Contributions and applications Functional genomics ( gene function identifications) Gene response analysis Comparative genomics Disease modeling Integrated genomics and proteomics  Potential for pharmacogenomics and toxicogenomics
Comparative Genomics Comparative genomics is the study of relationships between the genomes of different species
Comparative Genomics Comparative genomics is the study of relationships between the genomes of different species  Control Test Full  samples Hybridize  and  wash Microarray  Data Identify Patterns  of Gene Expression
Looking at some case studies Bioinformatics
Disease Modeling:  Leukemia case study There are two subtypes of acute leukemia based on their origins, either from lymphoid (ALL) or myeloid (AML). Expressional levels of 6817 genes of AML  (acute myeloid)  and ALL  (acute lymphoblastic)   patients:  Training: 38 bone marrow samples (27 ALL, 11 AML), Independent: 34 (bone marrow and peripheral blood; different sources of patients) samples (20 ALL, 14 AML). Objectives: Identify the most informative genes, Build models that can best explain instances.
Golub et al. 13 37 37 BioMiner Golub et al. & BioMiner
Understanding the data P20 p17 P17 P20 VR\ALLG\lam_train_lab_t_20020709_173359.html
Clear distinction between two classes: ALL and AML  Attributes are 50 genes identified by “discover and mask” Top 50 genes highlight 37 new targets not previously reported Predictor Model and informative Genes BioMiner singled out X95735 (Zyxin, cell adhesion) as the most informative gene, The predictor model is: Cross-validation (training set): 31/38 (81.58%) Test on new cases (independent set): 31/34 (91.18%)  If X95735 <= 938 then ALL;  - else  AML
Effects of anomalies on modeling Four experiments: (i) all 38 patients, (ii) remove patient 20, (iii) remove patient 17, and (iv) remove both patient 20 and 17. No improvement in the accuracy of prediction models. Variation on the models and their prediction rates might reflect missing knowledge caused by removing patients 20 and 17 that have much more anomalies.
Gene Identification in Alzheimer Disease 17 genes known to be related to AD 20 genes not reported before and  30 genes with unknown functions Using an Integrated Data Mining System with strong future potential Reported in the Journal of AI in Medicine – July 2004 (Elsevier) BioMiner 19.2K arrays from healthy & AD patients Identified 67 genes in 3 categories
7 genes, highly associated with HCV (Hepatitis C Virus) Using an Integrated Data Mining System with strong future potential Reported in the IEA/AIE Conference (May 2004) Next: CHEO will provide data from human arrays of HIV+ patients (NRC-Spain project) Gene Identification in HCV (CHEO) BioMiner from normal & infected mice Identified 7 genes most informative   15.6K arrays
Case study – multiple experiments Purpose: Study similarities and differences in macrophage (white blood cell, important for our immune system) activation by LPS and three types of Ganoderma Experiment: Used inflammatory macrophage of HeN mice - 48 h, 4 treatments: LPS, three Ganoderma (myriam, china, Nan) - Mouse 15 K arrays from OCI - Normaliser 3 (Brandon Smith) Experiments: Myriam 12693281, Nan 12693388 and China were rejected Data:
Data Analysis process CM SAM RP G1 G2 G3 Final Results Biological problem - Microarray Experiments Methods Results Biological/Literature Validation CM - Cluster mapping;  SAM - Significant Analysis of Microarrays RP - Rank products
Methods: SAM, RP and CM SAM  (Significant Analysis of Microarrays, V. Tusher  et al  - 2000) - assigns a score to each gene on the basis of change in gene expressions relative to the ST-Dev of repeated experiments. RP  (Rank Products, R. Breitling  et al  – 2004) Based on biological reasoning, and ranking product of all genes from all experiments.  CM  (Cluster Mapping, Famili,  et al , 2004) Identify clusters of genes with  common properties, across multiple  experiments - Use centroid data to derive new features and search for patterns, trends, etc. En E1 E2
Results: Significant genes discovered by different methods 1. Genes are common to  Nan and Myriam 2. Genes are common to  Nan, Myriam and LPS 3. Genes are common to  Nan and Myriam, but not LPS CM&SAM&RP (coverage rate) 18/46 known genes= 39%   22/48 known genes = 45.8% 2/9 known genes = 22% SAM&RP (No CM) (coverage rate) 4/14 known genes = 28.6% 12/58 known genes = 20.1% ?/4 known genes =? SAM&RP (total coverage rate) 22/60 known genes= 36.6%   34/106 known genes = 32% ? Group1: Group2: Group3 :
Some lessons learned Understanding the domain/problem is extremely important, Continuous interaction with domain experts, Proper data selection, data reduction and feature selection strategies, Data re-representation (e.g. normalization, constructive induction) is commonly required, Efficient data mining methods/processes/strategies are essential to knowledge discovery, And finally: Integration, structuring and dissemination of  new knowledge in an easily usable structure …
Short Summary and the evolution In the past:  Lack of large volume of data, Had to sometimes simulate data,  Had to convince owners of data to collaborate (e.g. demonstrate) Now, there is no shortage of:  Problems/research topics to work on, particularly complex ones, Lots of real-world data are available.
Short Summary (Cont’d) We have seen many successful applications of Knowledge Discovery methods (Aerospace, Genomics and Proteomics, Drug Discovery, Manufacturing, Finance/Banking, …) Key areas of KD that are evolving: Automated data analysis Integration of systems, tools, data base access, etc. Intelligent applications. Handling various forms of data (e.g. text, parametric data, images, etc.)
A simple comparison . . . Expert Systems Introduced in the 80’s and 90’s Lots of promise  Everyone jumped in Little results Many companies/tools disappeared Left bad impression Knowledge Discovery Several academic contributions Valuable applications with excellent results Potential for more research R&D will continue on for years to come Some ups and downs
Knowledge Management What is it? Why we need that? How can we manage knowledge?
What is Knowledge Management? Consists of a range of practices and techniques used by organizations to identify, represent and distribute knowledge, know-how, expertise, intellectual capital and other forms of knowledge for leverage, reuse and transfer of knowledge and learning across an organization.  Prime motivation for many researchers in KD… - What do we do with vast amount of discovered knowledge?
Why we need Knowledge Management? facilitate organizational operation/learning achieving shorter new product development cycles facilitating and managing organisational innovation leverage the expertise of people across the organization consistency in good practices
How can we manage it? Use of available technologies Developing new frameworks, infrastructures including new tools (many tools already exist) Need everyone’s participation Require to understand the culture One approach: Decision Support Systems
BioIntelligence Framework
Current Directions – Evolution of Knowledge Discovery (specific examples) Automated Knowledge discovery Integrated knowledge discovery (e.g. genomics & proteomics, etc. or Heterogeneous knowledge discovery)  Novel applications in bioinformatics:  Time-series genomics Phylogenetic  Gene identifications and disease modeling Personalized medicine Data tracking of patients and evaluation of drugs
Resources  Journals: IEEE, ACM, IDA, KDD and many more … Books Search in Google Sites:  Kd-nuggets ( http://www.kdnuggets.com/ ) http://www.the-data-mine.com/ Events Conferences (KDD, ACM, IEEE, IDA, PKDD, ML, etc) Tutorials/training sessions/workshops
Thank you!
Additional slides

Evolution of Knowledge Discovery and Management

  • 1.
    Evolution of KnowledgeDiscovery and Management Dr. A. Fazel Famili National Research Council of Canada Ottawa, ON K1A 0R6 Canada [email_address] October 28 th 2066
  • 2.
    Outline Background TheKnowledge Discovery Process Motivations for Knowledge Discovery Applications and some lessons learned The real evolution Summary
  • 3.
    Sequence data: C T A GG C T CC A G C T Time series The data mining process Discovered Knowledge: - Informative attributes - Thresholds - Relationships - Strength of Discovery - Parametric data Sensors data Documents/images Experiment data Knowledge Discovery: The process of discovering useful and previously unknown knowledge from historical or real-time data - Data Extraction and Selection - Data Pre-processing - Data Analysis (e.g. Pattern Recognition) - Post-processing This is what I need!
  • 4.
    Roots of knowledgeDiscovery Knowledge Discovery Parallel algorithms Machine Learning High performance computing Visualisation Database and Data Warehousing Data Visualization Applied Statistics
  • 5.
    Motivation Analysis capability(Software/Hardware Understanding the value of data Data production/ storage Knowledge Discovery
  • 6.
    Knowledge Discovery EffortsAlgorithm development Algorithm enhancements/extensions Benchmarking Development of KD tool boxes Real world applications Knowledge Discovery systems/software Generic Domain specific Batch processing vs on-line applications
  • 7.
    Typical Applications ofData Mining Sales/Marketing - Supermarkets – Provide better customer service – Improve cross-selling opportunities (beer and nappies) – Increase direct mail response rates Customer Retention - Banks Identify patterns of defection Predict likely defections Risk Assessment and Fraud Identify inappropriate or unusual behaviours Bioinformatics - Exploratory research Gene identification/gene response analysis/Disease modeling Management and operation of complex systems/ equipment Aerospace, e.g. identification and prediction of operation problems Process control e.g. yield management
  • 8.
    The real challenge:Bioinformatics - Genomics With the completion of Human Genome Map, > 30,000 genes in human ~ 3 billion base pairs of sequences (ACGT) to deal with, and … So many thousands in other species ,… How do they behave under different conditions? Identify gene functions and protein-protein interactions Discover gene responses to various conditions (e.g. environment, life) Technology advancements, high throughput biological experiments, genomics, proteomics, etc.
  • 9.
    The real challenge(cont’d) Huge influx of data produced in Biotech and health care (e.g. >300,000 biochips by OCI alone, >500,000 from Affymetrix, plus Agilent, GE, etc). many efforts on building tumor banks… etc. Patient data becoming available in all forms for: Accurate diagnosis Better treatments Intelligent drug discovery and target validation Electronic documents containing reports, results of research. Many more species are unknown
  • 10.
  • 11.
    Data in Genomicsand Proteomics Genomics - Microarrays Data: Quatitative Qualitative Complex Multi-layered Incomplete Informative Proteomics MS 2D/3D GELs Protein Arrays Sequence data
  • 12.
    Biological Data AnalysisNormalization Interesting Results Differentially expressed genes Models - Validation - Documentation Knowledge Discovery Microarray Data Data Pre-processing (Understanding the data) Pattern Searching Supervised methods Unsupervised methods
  • 13.
    Contributions and applicationsFunctional genomics ( gene function identifications) Gene response analysis Comparative genomics Disease modeling Integrated genomics and proteomics Potential for pharmacogenomics and toxicogenomics
  • 14.
    Comparative Genomics Comparativegenomics is the study of relationships between the genomes of different species
  • 15.
    Comparative Genomics Comparativegenomics is the study of relationships between the genomes of different species Control Test Full samples Hybridize and wash Microarray Data Identify Patterns of Gene Expression
  • 16.
    Looking at somecase studies Bioinformatics
  • 17.
    Disease Modeling: Leukemia case study There are two subtypes of acute leukemia based on their origins, either from lymphoid (ALL) or myeloid (AML). Expressional levels of 6817 genes of AML (acute myeloid) and ALL (acute lymphoblastic) patients: Training: 38 bone marrow samples (27 ALL, 11 AML), Independent: 34 (bone marrow and peripheral blood; different sources of patients) samples (20 ALL, 14 AML). Objectives: Identify the most informative genes, Build models that can best explain instances.
  • 18.
    Golub et al.13 37 37 BioMiner Golub et al. & BioMiner
  • 19.
    Understanding the dataP20 p17 P17 P20 VR\ALLG\lam_train_lab_t_20020709_173359.html
  • 20.
    Clear distinction betweentwo classes: ALL and AML Attributes are 50 genes identified by “discover and mask” Top 50 genes highlight 37 new targets not previously reported Predictor Model and informative Genes BioMiner singled out X95735 (Zyxin, cell adhesion) as the most informative gene, The predictor model is: Cross-validation (training set): 31/38 (81.58%) Test on new cases (independent set): 31/34 (91.18%) If X95735 <= 938 then ALL; - else AML
  • 21.
    Effects of anomalieson modeling Four experiments: (i) all 38 patients, (ii) remove patient 20, (iii) remove patient 17, and (iv) remove both patient 20 and 17. No improvement in the accuracy of prediction models. Variation on the models and their prediction rates might reflect missing knowledge caused by removing patients 20 and 17 that have much more anomalies.
  • 22.
    Gene Identification inAlzheimer Disease 17 genes known to be related to AD 20 genes not reported before and 30 genes with unknown functions Using an Integrated Data Mining System with strong future potential Reported in the Journal of AI in Medicine – July 2004 (Elsevier) BioMiner 19.2K arrays from healthy & AD patients Identified 67 genes in 3 categories
  • 23.
    7 genes, highlyassociated with HCV (Hepatitis C Virus) Using an Integrated Data Mining System with strong future potential Reported in the IEA/AIE Conference (May 2004) Next: CHEO will provide data from human arrays of HIV+ patients (NRC-Spain project) Gene Identification in HCV (CHEO) BioMiner from normal & infected mice Identified 7 genes most informative 15.6K arrays
  • 24.
    Case study –multiple experiments Purpose: Study similarities and differences in macrophage (white blood cell, important for our immune system) activation by LPS and three types of Ganoderma Experiment: Used inflammatory macrophage of HeN mice - 48 h, 4 treatments: LPS, three Ganoderma (myriam, china, Nan) - Mouse 15 K arrays from OCI - Normaliser 3 (Brandon Smith) Experiments: Myriam 12693281, Nan 12693388 and China were rejected Data:
  • 25.
    Data Analysis processCM SAM RP G1 G2 G3 Final Results Biological problem - Microarray Experiments Methods Results Biological/Literature Validation CM - Cluster mapping; SAM - Significant Analysis of Microarrays RP - Rank products
  • 26.
    Methods: SAM, RPand CM SAM (Significant Analysis of Microarrays, V. Tusher et al - 2000) - assigns a score to each gene on the basis of change in gene expressions relative to the ST-Dev of repeated experiments. RP (Rank Products, R. Breitling et al – 2004) Based on biological reasoning, and ranking product of all genes from all experiments. CM (Cluster Mapping, Famili, et al , 2004) Identify clusters of genes with common properties, across multiple experiments - Use centroid data to derive new features and search for patterns, trends, etc. En E1 E2
  • 27.
    Results: Significant genesdiscovered by different methods 1. Genes are common to Nan and Myriam 2. Genes are common to Nan, Myriam and LPS 3. Genes are common to Nan and Myriam, but not LPS CM&SAM&RP (coverage rate) 18/46 known genes= 39% 22/48 known genes = 45.8% 2/9 known genes = 22% SAM&RP (No CM) (coverage rate) 4/14 known genes = 28.6% 12/58 known genes = 20.1% ?/4 known genes =? SAM&RP (total coverage rate) 22/60 known genes= 36.6% 34/106 known genes = 32% ? Group1: Group2: Group3 :
  • 28.
    Some lessons learnedUnderstanding the domain/problem is extremely important, Continuous interaction with domain experts, Proper data selection, data reduction and feature selection strategies, Data re-representation (e.g. normalization, constructive induction) is commonly required, Efficient data mining methods/processes/strategies are essential to knowledge discovery, And finally: Integration, structuring and dissemination of new knowledge in an easily usable structure …
  • 29.
    Short Summary andthe evolution In the past: Lack of large volume of data, Had to sometimes simulate data, Had to convince owners of data to collaborate (e.g. demonstrate) Now, there is no shortage of: Problems/research topics to work on, particularly complex ones, Lots of real-world data are available.
  • 30.
    Short Summary (Cont’d)We have seen many successful applications of Knowledge Discovery methods (Aerospace, Genomics and Proteomics, Drug Discovery, Manufacturing, Finance/Banking, …) Key areas of KD that are evolving: Automated data analysis Integration of systems, tools, data base access, etc. Intelligent applications. Handling various forms of data (e.g. text, parametric data, images, etc.)
  • 31.
    A simple comparison. . . Expert Systems Introduced in the 80’s and 90’s Lots of promise Everyone jumped in Little results Many companies/tools disappeared Left bad impression Knowledge Discovery Several academic contributions Valuable applications with excellent results Potential for more research R&D will continue on for years to come Some ups and downs
  • 32.
    Knowledge Management Whatis it? Why we need that? How can we manage knowledge?
  • 33.
    What is KnowledgeManagement? Consists of a range of practices and techniques used by organizations to identify, represent and distribute knowledge, know-how, expertise, intellectual capital and other forms of knowledge for leverage, reuse and transfer of knowledge and learning across an organization. Prime motivation for many researchers in KD… - What do we do with vast amount of discovered knowledge?
  • 34.
    Why we needKnowledge Management? facilitate organizational operation/learning achieving shorter new product development cycles facilitating and managing organisational innovation leverage the expertise of people across the organization consistency in good practices
  • 35.
    How can wemanage it? Use of available technologies Developing new frameworks, infrastructures including new tools (many tools already exist) Need everyone’s participation Require to understand the culture One approach: Decision Support Systems
  • 36.
  • 37.
    Current Directions –Evolution of Knowledge Discovery (specific examples) Automated Knowledge discovery Integrated knowledge discovery (e.g. genomics & proteomics, etc. or Heterogeneous knowledge discovery) Novel applications in bioinformatics: Time-series genomics Phylogenetic Gene identifications and disease modeling Personalized medicine Data tracking of patients and evaluation of drugs
  • 38.
    Resources Journals:IEEE, ACM, IDA, KDD and many more … Books Search in Google Sites: Kd-nuggets ( http://www.kdnuggets.com/ ) http://www.the-data-mine.com/ Events Conferences (KDD, ACM, IEEE, IDA, PKDD, ML, etc) Tutorials/training sessions/workshops
  • 39.
  • 40.