Evolution of Knowledge Discovery and Management


Published on

Fazel Famili's presentation at InSciT2006.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Evolution of Knowledge Discovery and Management

    1. 1. Evolution of Knowledge Discovery and Management Dr. A. Fazel Famili National Research Council of Canada Ottawa, ON K1A 0R6 Canada [email_address] October 28 th 2066
    2. 2. Outline <ul><li>Background </li></ul><ul><li>The Knowledge Discovery Process </li></ul><ul><li>Motivations for Knowledge Discovery </li></ul><ul><li>Applications and some lessons learned </li></ul><ul><li>The real evolution </li></ul><ul><li>Summary </li></ul>
    3. 3. Sequence data: C T A GG C T CC A G C T Time series The data mining process Discovered Knowledge: - Informative attributes - Thresholds - Relationships - Strength of Discovery <ul><li>- Parametric data </li></ul><ul><li>Sensors data </li></ul><ul><li>Documents/images </li></ul><ul><li>Experiment data </li></ul>Knowledge Discovery: The process of discovering useful and previously unknown knowledge from historical or real-time data - Data Extraction and Selection - Data Pre-processing - Data Analysis (e.g. Pattern Recognition) - Post-processing This is what I need!
    4. 4. Roots of knowledge Discovery Knowledge Discovery Parallel algorithms Machine Learning High performance computing Visualisation Database and Data Warehousing Data Visualization Applied Statistics
    5. 5. Motivation Analysis capability (Software/Hardware Understanding the value of data Data production/ storage Knowledge Discovery
    6. 6. Knowledge Discovery Efforts <ul><li>Algorithm development </li></ul><ul><li>Algorithm enhancements/extensions </li></ul><ul><li>Benchmarking </li></ul><ul><li>Development of KD tool boxes </li></ul><ul><li>Real world applications </li></ul><ul><li>Knowledge Discovery systems/software </li></ul><ul><ul><li>Generic </li></ul></ul><ul><ul><li>Domain specific </li></ul></ul><ul><li>Batch processing vs on-line applications </li></ul>
    7. 7. Typical Applications of Data Mining <ul><li>Sales/Marketing - Supermarkets </li></ul><ul><ul><li>– Provide better customer service </li></ul></ul><ul><ul><li>– Improve cross-selling opportunities (beer and nappies) </li></ul></ul><ul><ul><li>– Increase direct mail response rates </li></ul></ul><ul><li>Customer Retention - Banks </li></ul><ul><ul><li>Identify patterns of defection </li></ul></ul><ul><ul><li>Predict likely defections </li></ul></ul><ul><li>Risk Assessment and Fraud </li></ul><ul><ul><li>Identify inappropriate or unusual behaviours </li></ul></ul><ul><li>Bioinformatics - Exploratory research </li></ul><ul><ul><li>Gene identification/gene response analysis/Disease modeling </li></ul></ul><ul><li>Management and operation of complex systems/ equipment </li></ul><ul><ul><li>Aerospace, e.g. identification and prediction of operation problems </li></ul></ul><ul><ul><li>Process control e.g. yield management </li></ul></ul>
    8. 8. The real challenge: Bioinformatics - Genomics <ul><li>With the completion of Human Genome Map, > 30,000 genes in human </li></ul><ul><li>~ 3 billion base pairs of sequences (ACGT) to deal with, and … </li></ul><ul><li>So many thousands in other species ,… </li></ul><ul><li>How do they behave under different conditions? </li></ul><ul><ul><li>Identify gene functions and protein-protein interactions </li></ul></ul><ul><ul><li>Discover gene responses to various conditions (e.g. environment, life) </li></ul></ul><ul><li>Technology advancements, high throughput biological experiments, genomics, proteomics, etc. </li></ul>
    9. 9. The real challenge (cont’d) <ul><li>Huge influx of data produced in Biotech and health care (e.g. >300,000 biochips by OCI alone, >500,000 from Affymetrix, plus Agilent, GE, etc). </li></ul><ul><ul><li>many efforts on building tumor banks… etc. </li></ul></ul><ul><li>Patient data becoming available in all forms for: </li></ul><ul><ul><li>Accurate diagnosis </li></ul></ul><ul><ul><li>Better treatments </li></ul></ul><ul><ul><li>Intelligent drug discovery and target validation </li></ul></ul><ul><li>Electronic documents containing reports, results of research. </li></ul><ul><li>Many more species are unknown </li></ul>
    10. 11. Data in Genomics and Proteomics Genomics - Microarrays <ul><li>Data: </li></ul><ul><li>Quatitative </li></ul><ul><li>Qualitative </li></ul><ul><li>Complex </li></ul><ul><li>Multi-layered </li></ul><ul><li>Incomplete </li></ul><ul><li>Informative </li></ul>Proteomics MS 2D/3D GELs Protein Arrays Sequence data
    11. 12. Biological Data Analysis Normalization <ul><li>Interesting Results </li></ul><ul><li>Differentially expressed genes </li></ul><ul><li>Models </li></ul>- Validation - Documentation Knowledge Discovery Microarray Data Data Pre-processing (Understanding the data) <ul><li>Pattern Searching </li></ul><ul><li>Supervised methods </li></ul><ul><li>Unsupervised methods </li></ul>
    12. 13. Contributions and applications <ul><ul><li>Functional genomics ( gene function identifications) </li></ul></ul><ul><ul><li>Gene response analysis </li></ul></ul><ul><ul><li>Comparative genomics </li></ul></ul><ul><ul><li>Disease modeling </li></ul></ul><ul><ul><li>Integrated genomics and proteomics </li></ul></ul><ul><ul><li>Potential for pharmacogenomics and toxicogenomics </li></ul></ul>
    13. 14. Comparative Genomics <ul><li>Comparative genomics is the study of relationships between the genomes of different species </li></ul>
    14. 15. Comparative Genomics <ul><li>Comparative genomics is the study of relationships between the genomes of different species </li></ul>Control Test Full samples Hybridize and wash Microarray Data Identify Patterns of Gene Expression
    15. 16. Looking at some case studies Bioinformatics
    16. 17. Disease Modeling: Leukemia case study <ul><li>There are two subtypes of acute leukemia based on their origins, either from lymphoid (ALL) or myeloid (AML). </li></ul><ul><li>Expressional levels of 6817 genes of AML (acute myeloid) and ALL (acute lymphoblastic) patients: </li></ul><ul><ul><li>Training: 38 bone marrow samples (27 ALL, 11 AML), </li></ul></ul><ul><ul><li>Independent: 34 (bone marrow and peripheral blood; different sources of patients) samples (20 ALL, 14 AML). </li></ul></ul><ul><li>Objectives: </li></ul><ul><ul><li>Identify the most informative genes, </li></ul></ul><ul><ul><li>Build models that can best explain instances. </li></ul></ul>
    17. 18. Golub et al. 13 37 37 BioMiner Golub et al. & BioMiner
    18. 19. Understanding the data P20 p17 P17 P20 VRALLGlam_train_lab_t_20020709_173359.html
    19. 20. <ul><li>Clear distinction between two classes: ALL and AML </li></ul><ul><li>Attributes are 50 genes identified by “discover and mask” </li></ul><ul><li>Top 50 genes highlight 37 new targets not previously reported </li></ul>Predictor Model and informative Genes <ul><li>BioMiner singled out X95735 (Zyxin, cell adhesion) as the most informative gene, </li></ul><ul><ul><li>The predictor model is: </li></ul></ul><ul><ul><li>Cross-validation (training set): 31/38 (81.58%) </li></ul></ul><ul><ul><li>Test on new cases (independent set): 31/34 (91.18%) </li></ul></ul><ul><ul><ul><li>If X95735 <= 938 then ALL; </li></ul></ul></ul><ul><ul><ul><ul><li>- else AML </li></ul></ul></ul></ul>
    20. 21. Effects of anomalies on modeling <ul><li>Four experiments: (i) all 38 patients, (ii) remove patient 20, (iii) remove patient 17, and (iv) remove both patient 20 and 17. </li></ul><ul><li>No improvement in the accuracy of prediction models. </li></ul><ul><li>Variation on the models and their prediction rates might reflect missing knowledge caused by removing patients 20 and 17 that have much more anomalies. </li></ul>
    21. 22. Gene Identification in Alzheimer Disease <ul><li>17 genes known to be related to AD </li></ul><ul><li>20 genes not reported before and </li></ul><ul><li>30 genes with unknown functions </li></ul><ul><li>Using an Integrated Data Mining System with strong future potential </li></ul><ul><li>Reported in the Journal of AI in Medicine – July 2004 (Elsevier) </li></ul>BioMiner 19.2K arrays from healthy & AD patients Identified 67 genes in 3 categories
    22. 23. <ul><li>7 genes, highly associated with HCV (Hepatitis C Virus) </li></ul><ul><li>Using an Integrated Data Mining System with strong future potential </li></ul><ul><li>Reported in the IEA/AIE Conference (May 2004) </li></ul><ul><li>Next: CHEO will provide data from human arrays of HIV+ patients (NRC-Spain project) </li></ul>Gene Identification in HCV (CHEO) BioMiner from normal & infected mice Identified 7 genes most informative 15.6K arrays
    23. 24. Case study – multiple experiments <ul><li>Purpose: Study similarities and differences in macrophage (white blood cell, important for our immune system) activation by LPS and three types of Ganoderma </li></ul><ul><li>Experiment: Used inflammatory macrophage of HeN mice - 48 h, 4 treatments: LPS, three Ganoderma (myriam, china, Nan) - Mouse 15 K arrays from OCI - Normaliser 3 (Brandon Smith) </li></ul>Experiments: Myriam 12693281, Nan 12693388 and China were rejected Data:
    24. 25. Data Analysis process CM SAM RP G1 G2 G3 Final Results Biological problem - Microarray Experiments Methods Results Biological/Literature Validation CM - Cluster mapping; SAM - Significant Analysis of Microarrays RP - Rank products
    25. 26. Methods: SAM, RP and CM <ul><li>SAM (Significant Analysis of Microarrays, V. Tusher et al - 2000) </li></ul><ul><li>- assigns a score to each gene on the basis of change in gene expressions relative to the ST-Dev of repeated experiments. </li></ul><ul><li>RP (Rank Products, R. Breitling et al – 2004) </li></ul><ul><ul><li>Based on biological reasoning, and ranking product of all genes from all experiments. </li></ul></ul><ul><li>CM (Cluster Mapping, Famili, et al , 2004) </li></ul><ul><ul><li>Identify clusters of genes with </li></ul></ul><ul><ul><li>common properties, across multiple </li></ul></ul><ul><ul><li>experiments </li></ul></ul><ul><ul><li>- Use centroid data to derive new features and </li></ul></ul><ul><ul><li>search for patterns, trends, etc. </li></ul></ul>En E1 E2
    26. 27. Results: Significant genes discovered by different methods 1. Genes are common to Nan and Myriam 2. Genes are common to Nan, Myriam and LPS 3. Genes are common to Nan and Myriam, but not LPS CM&SAM&RP (coverage rate) 18/46 known genes= 39% 22/48 known genes = 45.8% 2/9 known genes = 22% SAM&RP (No CM) (coverage rate) 4/14 known genes = 28.6% 12/58 known genes = 20.1% ?/4 known genes =? SAM&RP (total coverage rate) 22/60 known genes= 36.6% 34/106 known genes = 32% ? Group1: Group2: Group3 :
    27. 28. Some lessons learned <ul><li>Understanding the domain/problem is extremely important, </li></ul><ul><li>Continuous interaction with domain experts, </li></ul><ul><li>Proper data selection, data reduction and feature selection strategies, </li></ul><ul><li>Data re-representation (e.g. normalization, constructive induction) is commonly required, </li></ul><ul><li>Efficient data mining methods/processes/strategies are essential to knowledge discovery, </li></ul><ul><li>And finally: </li></ul><ul><ul><li>Integration, structuring and dissemination of new knowledge in an easily usable structure … </li></ul></ul>
    28. 29. Short Summary and the evolution <ul><li>In the past: </li></ul><ul><ul><li>Lack of large volume of data, </li></ul></ul><ul><ul><li>Had to sometimes simulate data, </li></ul></ul><ul><ul><li>Had to convince owners of data to collaborate (e.g. demonstrate) </li></ul></ul><ul><li>Now, there is no shortage of: </li></ul><ul><ul><li>Problems/research topics to work on, particularly complex ones, </li></ul></ul><ul><ul><li>Lots of real-world data are available. </li></ul></ul>
    29. 30. Short Summary (Cont’d) <ul><li>We have seen many successful applications of Knowledge Discovery methods (Aerospace, Genomics and Proteomics, Drug Discovery, Manufacturing, Finance/Banking, …) </li></ul><ul><li>Key areas of KD that are evolving: </li></ul><ul><ul><li>Automated data analysis </li></ul></ul><ul><ul><li>Integration of systems, tools, data base access, etc. </li></ul></ul><ul><ul><li>Intelligent applications. </li></ul></ul><ul><ul><li>Handling various forms of data (e.g. text, parametric data, images, etc.) </li></ul></ul>
    30. 31. A simple comparison . . . <ul><li>Expert Systems </li></ul><ul><ul><li>Introduced in the 80’s and 90’s </li></ul></ul><ul><ul><li>Lots of promise </li></ul></ul><ul><ul><li>Everyone jumped in </li></ul></ul><ul><ul><li>Little results </li></ul></ul><ul><ul><li>Many companies/tools disappeared </li></ul></ul><ul><ul><li>Left bad impression </li></ul></ul><ul><li>Knowledge Discovery </li></ul><ul><ul><li>Several academic contributions </li></ul></ul><ul><ul><li>Valuable applications with excellent results </li></ul></ul><ul><ul><li>Potential for more research </li></ul></ul><ul><ul><li>R&D will continue on for years to come </li></ul></ul><ul><ul><li>Some ups and downs </li></ul></ul>
    31. 32. Knowledge Management <ul><li>What is it? </li></ul><ul><li>Why we need that? </li></ul><ul><li>How can we manage knowledge? </li></ul>
    32. 33. What is Knowledge Management? <ul><li>Consists of a range of practices and techniques used by organizations to identify, represent and distribute knowledge, know-how, expertise, intellectual capital and other forms of knowledge for leverage, reuse and transfer of knowledge and learning across an organization. </li></ul><ul><li>Prime motivation for many researchers in KD… </li></ul><ul><ul><ul><li>- What do we do with vast amount of discovered knowledge? </li></ul></ul></ul>
    33. 34. Why we need Knowledge Management? <ul><li>facilitate organizational operation/learning </li></ul><ul><li>achieving shorter new product development cycles </li></ul><ul><li>facilitating and managing organisational innovation </li></ul><ul><li>leverage the expertise of people across the organization </li></ul><ul><li>consistency in good practices </li></ul>
    34. 35. How can we manage it? <ul><li>Use of available technologies </li></ul><ul><li>Developing new frameworks, infrastructures including new tools (many tools already exist) </li></ul><ul><li>Need everyone’s participation </li></ul><ul><li>Require to understand the culture </li></ul><ul><ul><ul><ul><ul><li>One approach: Decision Support Systems </li></ul></ul></ul></ul></ul>
    35. 36. BioIntelligence Framework
    36. 37. Current Directions – Evolution of Knowledge Discovery (specific examples) <ul><li>Automated Knowledge discovery </li></ul><ul><li>Integrated knowledge discovery (e.g. genomics & proteomics, etc. or Heterogeneous knowledge discovery) </li></ul><ul><li>Novel applications in bioinformatics: </li></ul><ul><ul><li>Time-series genomics </li></ul></ul><ul><ul><li>Phylogenetic </li></ul></ul><ul><ul><li>Gene identifications and disease modeling </li></ul></ul><ul><ul><li>Personalized medicine </li></ul></ul><ul><li>Data tracking of patients and evaluation of drugs </li></ul>
    37. 38. Resources <ul><li>Journals: </li></ul><ul><ul><li>IEEE, ACM, IDA, KDD and many more … </li></ul></ul><ul><li>Books </li></ul><ul><ul><li>Search in Google </li></ul></ul><ul><li>Sites: </li></ul><ul><ul><li>Kd-nuggets ( http://www.kdnuggets.com/ ) </li></ul></ul><ul><ul><li>http://www.the-data-mine.com/ </li></ul></ul><ul><li>Events </li></ul><ul><ul><li>Conferences (KDD, ACM, IEEE, IDA, PKDD, ML, etc) </li></ul></ul><ul><ul><li>Tutorials/training sessions/workshops </li></ul></ul>
    38. 39. Thank you!
    39. 40. Additional slides