Evolution of Knowledge Discovery and Management

Evolution of Knowledge Discovery and Management Dr. A. Fazel Famili National Research Council of Canada Ottawa, ON K1A 0R6 Canada [email_address] October 28 th 2066

Outline Background The Knowledge Discovery Process Motivations for Knowledge Discovery Applications and some lessons learned The real evolution Summary

Sequence data: C T A GG C T CC A G C T Time series The data mining process Discovered Knowledge: - Informative attributes - Thresholds - Relationships - Strength of Discovery - Parametric data Sensors data Documents/images Experiment data Knowledge Discovery: The process of discovering useful and previously unknown knowledge from historical or real-time data - Data Extraction and Selection - Data Pre-processing - Data Analysis (e.g. Pattern Recognition) - Post-processing This is what I need!

Roots of knowledge Discovery Knowledge Discovery Parallel algorithms Machine Learning High performance computing Visualisation Database and Data Warehousing Data Visualization Applied Statistics

Motivation Analysis capability (Software/Hardware Understanding the value of data Data production/ storage Knowledge Discovery

Knowledge Discovery Efforts Algorithm development Algorithm enhancements/extensions Benchmarking Development of KD tool boxes Real world applications Knowledge Discovery systems/software Generic Domain specific Batch processing vs on-line applications

Typical Applications of Data Mining Sales/Marketing - Supermarkets – Provide better customer service – Improve cross-selling opportunities (beer and nappies) – Increase direct mail response rates Customer Retention - Banks Identify patterns of defection Predict likely defections Risk Assessment and Fraud Identify inappropriate or unusual behaviours Bioinformatics - Exploratory research Gene identification/gene response analysis/Disease modeling Management and operation of complex systems/ equipment Aerospace, e.g. identification and prediction of operation problems Process control e.g. yield management

The real challenge: Bioinformatics - Genomics With the completion of Human Genome Map, > 30,000 genes in human ~ 3 billion base pairs of sequences (ACGT) to deal with, and … So many thousands in other species ,… How do they behave under different conditions? Identify gene functions and protein-protein interactions Discover gene responses to various conditions (e.g. environment, life) Technology advancements, high throughput biological experiments, genomics, proteomics, etc.

The real challenge (cont’d) Huge influx of data produced in Biotech and health care (e.g. >300,000 biochips by OCI alone, >500,000 from Affymetrix, plus Agilent, GE, etc). many efforts on building tumor banks… etc. Patient data becoming available in all forms for: Accurate diagnosis Better treatments Intelligent drug discovery and target validation Electronic documents containing reports, results of research. Many more species are unknown

Data in Genomics and Proteomics Genomics - Microarrays Data: Quatitative Qualitative Complex Multi-layered Incomplete Informative Proteomics MS 2D/3D GELs Protein Arrays Sequence data

Biological Data Analysis Normalization Interesting Results Differentially expressed genes Models - Validation - Documentation Knowledge Discovery Microarray Data Data Pre-processing (Understanding the data) Pattern Searching Supervised methods Unsupervised methods

Contributions and applications Functional genomics ( gene function identifications) Gene response analysis Comparative genomics Disease modeling Integrated genomics and proteomics Potential for pharmacogenomics and toxicogenomics

Comparative Genomics Comparative genomics is the study of relationships between the genomes of different species

Comparative Genomics Comparative genomics is the study of relationships between the genomes of different species Control Test Full samples Hybridize and wash Microarray Data Identify Patterns of Gene Expression

Looking at some case studies Bioinformatics

Disease Modeling: Leukemia case study There are two subtypes of acute leukemia based on their origins, either from lymphoid (ALL) or myeloid (AML). Expressional levels of 6817 genes of AML (acute myeloid) and ALL (acute lymphoblastic) patients: Training: 38 bone marrow samples (27 ALL, 11 AML), Independent: 34 (bone marrow and peripheral blood; different sources of patients) samples (20 ALL, 14 AML). Objectives: Identify the most informative genes, Build models that can best explain instances.

Golub et al. 13 37 37 BioMiner Golub et al. & BioMiner

Understanding the data P20 p17 P17 P20 VR\ALLG\lam_train_lab_t_20020709_173359.html

Clear distinction between two classes: ALL and AML Attributes are 50 genes identified by “discover and mask” Top 50 genes highlight 37 new targets not previously reported Predictor Model and informative Genes BioMiner singled out X95735 (Zyxin, cell adhesion) as the most informative gene, The predictor model is: Cross-validation (training set): 31/38 (81.58%) Test on new cases (independent set): 31/34 (91.18%) If X95735 <= 938 then ALL; - else AML

Effects of anomalies on modeling Four experiments: (i) all 38 patients, (ii) remove patient 20, (iii) remove patient 17, and (iv) remove both patient 20 and 17. No improvement in the accuracy of prediction models. Variation on the models and their prediction rates might reflect missing knowledge caused by removing patients 20 and 17 that have much more anomalies.

Gene Identification in Alzheimer Disease 17 genes known to be related to AD 20 genes not reported before and 30 genes with unknown functions Using an Integrated Data Mining System with strong future potential Reported in the Journal of AI in Medicine – July 2004 (Elsevier) BioMiner 19.2K arrays from healthy & AD patients Identified 67 genes in 3 categories

7 genes, highly associated with HCV (Hepatitis C Virus) Using an Integrated Data Mining System with strong future potential Reported in the IEA/AIE Conference (May 2004) Next: CHEO will provide data from human arrays of HIV+ patients (NRC-Spain project) Gene Identification in HCV (CHEO) BioMiner from normal & infected mice Identified 7 genes most informative 15.6K arrays

Case study – multiple experiments Purpose: Study similarities and differences in macrophage (white blood cell, important for our immune system) activation by LPS and three types of Ganoderma Experiment: Used inflammatory macrophage of HeN mice - 48 h, 4 treatments: LPS, three Ganoderma (myriam, china, Nan) - Mouse 15 K arrays from OCI - Normaliser 3 (Brandon Smith) Experiments: Myriam 12693281, Nan 12693388 and China were rejected Data:

Data Analysis process CM SAM RP G1 G2 G3 Final Results Biological problem - Microarray Experiments Methods Results Biological/Literature Validation CM - Cluster mapping; SAM - Significant Analysis of Microarrays RP - Rank products

Methods: SAM, RP and CM SAM (Significant Analysis of Microarrays, V. Tusher et al - 2000) - assigns a score to each gene on the basis of change in gene expressions relative to the ST-Dev of repeated experiments. RP (Rank Products, R. Breitling et al – 2004) Based on biological reasoning, and ranking product of all genes from all experiments. CM (Cluster Mapping, Famili, et al , 2004) Identify clusters of genes with common properties, across multiple experiments - Use centroid data to derive new features and search for patterns, trends, etc. En E1 E2

Results: Significant genes discovered by different methods 1. Genes are common to Nan and Myriam 2. Genes are common to Nan, Myriam and LPS 3. Genes are common to Nan and Myriam, but not LPS CM&SAM&RP (coverage rate) 18/46 known genes= 39% 22/48 known genes = 45.8% 2/9 known genes = 22% SAM&RP (No CM) (coverage rate) 4/14 known genes = 28.6% 12/58 known genes = 20.1% ?/4 known genes =? SAM&RP (total coverage rate) 22/60 known genes= 36.6% 34/106 known genes = 32% ? Group1: Group2: Group3 :

Some lessons learned Understanding the domain/problem is extremely important, Continuous interaction with domain experts, Proper data selection, data reduction and feature selection strategies, Data re-representation (e.g. normalization, constructive induction) is commonly required, Efficient data mining methods/processes/strategies are essential to knowledge discovery, And finally: Integration, structuring and dissemination of new knowledge in an easily usable structure …

Short Summary and the evolution In the past: Lack of large volume of data, Had to sometimes simulate data, Had to convince owners of data to collaborate (e.g. demonstrate) Now, there is no shortage of: Problems/research topics to work on, particularly complex ones, Lots of real-world data are available.

Short Summary (Cont’d) We have seen many successful applications of Knowledge Discovery methods (Aerospace, Genomics and Proteomics, Drug Discovery, Manufacturing, Finance/Banking, …) Key areas of KD that are evolving: Automated data analysis Integration of systems, tools, data base access, etc. Intelligent applications. Handling various forms of data (e.g. text, parametric data, images, etc.)

A simple comparison . . . Expert Systems Introduced in the 80’s and 90’s Lots of promise Everyone jumped in Little results Many companies/tools disappeared Left bad impression Knowledge Discovery Several academic contributions Valuable applications with excellent results Potential for more research R&D will continue on for years to come Some ups and downs

Knowledge Management What is it? Why we need that? How can we manage knowledge?

What is Knowledge Management? Consists of a range of practices and techniques used by organizations to identify, represent and distribute knowledge, know-how, expertise, intellectual capital and other forms of knowledge for leverage, reuse and transfer of knowledge and learning across an organization. Prime motivation for many researchers in KD… - What do we do with vast amount of discovered knowledge?

Why we need Knowledge Management? facilitate organizational operation/learning achieving shorter new product development cycles facilitating and managing organisational innovation leverage the expertise of people across the organization consistency in good practices

How can we manage it? Use of available technologies Developing new frameworks, infrastructures including new tools (many tools already exist) Need everyone’s participation Require to understand the culture One approach: Decision Support Systems

Current Directions – Evolution of Knowledge Discovery (specific examples) Automated Knowledge discovery Integrated knowledge discovery (e.g. genomics & proteomics, etc. or Heterogeneous knowledge discovery) Novel applications in bioinformatics: Time-series genomics Phylogenetic Gene identifications and disease modeling Personalized medicine Data tracking of patients and evaluation of drugs

Resources Journals: IEEE, ACM, IDA, KDD and many more … Books Search in Google Sites: Kd-nuggets ( http://www.kdnuggets.com/ ) http://www.the-data-mine.com/ Events Conferences (KDD, ACM, IEEE, IDA, PKDD, ML, etc) Tutorials/training sessions/workshops

Evolution of Knowledge Discovery and Management

More Related Content

What's hot

Viewers also liked

Similar to Evolution of Knowledge Discovery and Management

More from inscit2006

Recently uploaded

Evolution of Knowledge Discovery and Management