Knowledge discovery thru data mining


Published on

key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Knowledge discovery thru data mining

  1. 1. Knowledge Discovery through Data Mining C. Devakumar Indian Council of Agricultural Research New Delhi-110 012 [email_address]
  2. 2. Introduction <ul><li>Motivation: Why data mining? </li></ul><ul><li>What is data mining? </li></ul><ul><li>Data Mining: On what kind of data? </li></ul><ul><li>Data mining functionality </li></ul><ul><li>Classification of data mining systems </li></ul><ul><li>Major issues in data mining </li></ul>
  3. 3. Why Data Mining? <ul><li>The Explosive Growth of Data: from terabytes to petabytes </li></ul><ul><ul><li>Data collection and data availability </li></ul></ul><ul><ul><ul><li>Automated data collection tools, database systems, Web, computerized society </li></ul></ul></ul><ul><ul><li>Major sources of abundant data </li></ul></ul><ul><ul><ul><li>Business: Web, e-commerce, transactions, stocks, … </li></ul></ul></ul><ul><ul><ul><li>Science: Remote sensing, bioinformatics, scientific simulation, … </li></ul></ul></ul><ul><ul><ul><li>Society and everyone: news, digital cameras, YouTube </li></ul></ul></ul><ul><li>We are drowning in data, but starving for knowledge! </li></ul><ul><li>“ Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets </li></ul>
  4. 4. Evolution of Database Technology <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation, IMS and network DBMS </li></ul></ul><ul><li>1970s: </li></ul><ul><ul><li>Relational data model, relational DBMS implementation </li></ul></ul><ul><li>1980s: </li></ul><ul><ul><li>RDBMS, advanced data models (extended-relational, OO, deductive, etc.) </li></ul></ul><ul><ul><li>Application-oriented DBMS (spatial, scientific, engineering, etc.) </li></ul></ul><ul><li>1990s: </li></ul><ul><ul><li>Data mining, data warehousing, multimedia databases, and Web databases </li></ul></ul><ul><li>2000s </li></ul><ul><ul><li>Stream data management and mining </li></ul></ul><ul><ul><li>Data mining and its applications </li></ul></ul><ul><ul><li>Web technology (XML, data integration) and global information systems </li></ul></ul>
  5. 5. What is Data Mining? <ul><li>Many Definitions </li></ul><ul><ul><li>Non-trivial extraction of implicit, previously unknown and potentially useful information from data </li></ul></ul><ul><ul><li>Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns </li></ul></ul>
  6. 6. What is not Data Mining? <ul><li>Look up phone number in phone directory </li></ul><ul><ul><li>Query a Web search engine for information </li></ul></ul>
  7. 7. Origins of Data Mining <ul><li>Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems </li></ul><ul><li>Traditional Techniques may be unsuitable due to </li></ul><ul><ul><li>Enormity of data </li></ul></ul><ul><ul><li>High dimensionality of data </li></ul></ul><ul><ul><li>Heterogeneous, distributed nature of data </li></ul></ul>Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems
  8. 8. + = Data Interestingness criteria Hidden patterns
  9. 9. + = Data Interestingness criteria Hidden patterns Type of Patterns
  10. 10. + = Data Interestingness criteria Hidden patterns Type of data Type of Interestingness criteria
  11. 11. Type of Data <ul><li>Tabular (Ex: Transaction data) </li></ul><ul><ul><li>Relational </li></ul></ul><ul><ul><li>Multi-dimensional </li></ul></ul><ul><li>Spatial (Ex: Remote sensing data) </li></ul><ul><li>Temporal (Ex: Log information) </li></ul><ul><ul><li>Streaming (Ex: multimedia, network traffic) </li></ul></ul><ul><ul><li>Spatio-temporal (Ex: GIS) </li></ul></ul><ul><li>Tree (Ex: XML data) </li></ul><ul><li>Graphs (Ex: WWW, BioMolecular data) </li></ul><ul><li>Sequence (Ex: DNA, activity logs) </li></ul><ul><li>Text, Multimedia … </li></ul>
  12. 12. Type of Interestingness <ul><li>Frequency </li></ul><ul><li>Rarity </li></ul><ul><li>Correlation </li></ul><ul><li>Length of occurrence (for sequence and temporal data) </li></ul><ul><li>Consistency </li></ul><ul><li>Repeating / periodicity </li></ul><ul><li>“ Abnormal” behavior </li></ul><ul><li>Other patterns of interestingness… </li></ul>
  13. 13. Statistics: Conceptual Model (Hypothesis) Statistical Reasoning “ Proof” (Validation of Hypothesis)
  14. 14. Data mining: Mining Algorithm Based on Interestingness Data Pattern (model, rule, hypothesis) discovery
  15. 15. Explores Your Data Finds Patterns Performs Predictions
  16. 16. Presentation Exploration Discovery Passive Interactive Proactive Role of Software Business Insight Predictive Analysis Canned reporting Ad-hoc reporting OLAP Data mining
  17. 17. Data Mining Tasks <ul><li>Prediction Methods </li></ul><ul><ul><li>Use some variables to predict unknown or future values of other variables. </li></ul></ul><ul><li>Description Methods </li></ul><ul><ul><li>Find human-interpretable patterns that describe the data. </li></ul></ul>
  18. 18. Data Mining Tasks... <ul><li>Classification [Predictive] </li></ul><ul><li>Regression [Predictive] </li></ul><ul><li>Clustering [Descriptive] </li></ul><ul><li>Association Rule Discovery [Descriptive] </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><ul><li>Semi-supervised Clustering </li></ul></ul><ul><ul><li>Semi-supervised Classification </li></ul></ul>
  19. 19. Challenges of Data Mining <ul><li>Scalability </li></ul><ul><li>Dimensionality </li></ul><ul><li>Complex and Heterogeneous Data </li></ul><ul><li>Data Quality </li></ul><ul><li>Data Ownership and Distribution </li></ul><ul><li>Privacy Preservation </li></ul><ul><li>Streaming Data </li></ul>
  20. 20. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  21. 21. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  22. 22. Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowledge-Base Database Data Warehouse World-Wide Web Other Info Repositories
  23. 23. Multi-Dimensional View of Data Mining <ul><li>Data to be mined </li></ul><ul><ul><li>Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW </li></ul></ul><ul><li>Knowledge to be mined </li></ul><ul><ul><li>Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. </li></ul></ul><ul><ul><li>Multiple/integrated functions and mining at multiple levels </li></ul></ul><ul><li>Techniques utilized </li></ul><ul><ul><li>Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. </li></ul></ul><ul><li>Applications adapted </li></ul><ul><ul><li>Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. </li></ul></ul>
  24. 24. Data Mining: Classification Schemes <ul><li>General functionality </li></ul><ul><ul><li>Descriptive data mining </li></ul></ul><ul><ul><li>Predictive data mining </li></ul></ul><ul><li>Different views lead to different classifications </li></ul><ul><ul><li>Data view: Kinds of data to be mined </li></ul></ul><ul><ul><li>Knowledge view: Kinds of knowledge to be discovered </li></ul></ul><ul><ul><li>Method view: Kinds of techniques utilized </li></ul></ul><ul><ul><li>Application view: Kinds of applications adapted </li></ul></ul>
  25. 25. Data Mining Functionalities <ul><li>Multidimensional concept description: Characterization and discrimination </li></ul><ul><ul><li>Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions </li></ul></ul><ul><li>Frequent patterns, association, correlation vs. causality </li></ul><ul><li>Classification and prediction </li></ul><ul><ul><li>Construct models (functions) that describe and distinguish classes or concepts for future prediction </li></ul></ul><ul><ul><ul><li>E.g., classify countries based on (climate), or classify cars based on (gas mileage) </li></ul></ul></ul><ul><ul><li>Predict some unknown or missing numerical values </li></ul></ul>
  26. 26. Data Mining Functionalities (2) <ul><li>Cluster analysis </li></ul><ul><ul><li>Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns </li></ul></ul><ul><ul><li>Maximizing intra-class similarity & minimizing interclass similarity </li></ul></ul><ul><li>Outlier analysis </li></ul><ul><ul><li>Outlier: Data object that does not comply with the general behavior of the data </li></ul></ul><ul><ul><li>Noise or exception? Useful in fraud detection, rare events analysis </li></ul></ul><ul><li>Trend and evolution analysis </li></ul><ul><ul><li>Trend and deviation: e.g., regression analysis </li></ul></ul><ul><ul><li>Sequential pattern mining: e.g., digital camera  large SD memory </li></ul></ul><ul><ul><li>Periodicity analysis </li></ul></ul><ul><ul><li>Similarity-based analysis </li></ul></ul><ul><li>Other pattern-directed or statistical analyses </li></ul>
  27. 27. Major Issues in Data Mining <ul><li>Mining methodology </li></ul><ul><ul><li>Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web </li></ul></ul><ul><ul><li>Performance: efficiency, effectiveness, and scalability </li></ul></ul><ul><ul><li>Pattern evaluation: the interestingness problem </li></ul></ul><ul><ul><li>Incorporation of background knowledge </li></ul></ul><ul><ul><li>Handling noise and incomplete data </li></ul></ul><ul><ul><li>Parallel, distributed and incremental mining methods </li></ul></ul><ul><ul><li>Integration of the discovered knowledge with existing one: knowledge fusion </li></ul></ul><ul><li>User interaction </li></ul><ul><ul><li>Data mining query languages and ad-hoc mining </li></ul></ul><ul><ul><li>Expression and visualization of data mining results </li></ul></ul><ul><ul><li>Interactive mining of knowledge at multiple levels of abstraction </li></ul></ul><ul><li>Applications and social impacts </li></ul><ul><ul><li>Domain-specific data mining & invisible data mining </li></ul></ul><ul><ul><li>Protection of data security, integrity, and privacy </li></ul></ul>
  28. 28. Bioinformatics, Computational Biology, Data Mining <ul><li>Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. </li></ul><ul><li>Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. </li></ul><ul><ul><li>Genomes (viruses, bacteria, fungi, plants, insects,…) </li></ul></ul><ul><ul><li>Proteins and Proteomes </li></ul></ul><ul><ul><li>Biological Sequences </li></ul></ul><ul><ul><li>Molecular Function and Structure </li></ul></ul><ul><li>Data Mining is searching for knowledge in data </li></ul><ul><ul><li>Knowledge mining from databases </li></ul></ul><ul><ul><li>Knowledge extraction </li></ul></ul><ul><ul><li>Data/pattern analysis </li></ul></ul><ul><ul><li>Data dredging </li></ul></ul><ul><ul><li>Knowledge Discovery in Databases (KDD) </li></ul></ul>
  29. 29. <ul><ul><li>Data production at the levels of molecules, cells, organs, organisms, populations </li></ul></ul><ul><ul><li>Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … </li></ul></ul><ul><ul><li>Prediction of Molecular Function and Structure </li></ul></ul><ul><ul><li>Computational biology: synthesis (simulations) and analysis (machine learning) </li></ul></ul>Problems in Bioinformatics Domain
  30. 30. Tokenization , which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces Filtering methods remove words like articles, conjunctions, prepositions, etc. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic preprocessing N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.).
  31. 31. Castellano, M. et al. A bioinformatics knowledge discovery in text application for grid Computing BMC Bioinformatics 2009, 10(Suppl 6):S23
  32. 32. BIOINFORMATICS ARCHITECTURE The Layer Architecture consisting of GATE 4.0 Toolkit for Text Mining, a Middleware solution written by Java API, the grid infrastructure middleware, and a physical layer that consists of a Gnu/Linux Operating System. The integrated development environment, GATE was used for the text mining process. GATE operated on a collection of scientific publications in full text available on MedLine/Pubmed (in pdf format) using the process of Text Mining
  33. 33. Technology Platform
  34. 34. What is New in SQL Server 2008? Data Mining Enhancements <ul><li>In addition to other new aspects of SQL Server: </li></ul><ul><ul><li>Enhanced Mining Structures </li></ul></ul><ul><ul><ul><li>Easier to prepare and test your models </li></ul></ul></ul><ul><ul><ul><li>Models allow for cross-validation </li></ul></ul></ul><ul><ul><ul><li>Filtering </li></ul></ul></ul><ul><ul><li>Algorithm Updates </li></ul></ul><ul><ul><ul><li>Improved Time Series algorithm combining best of ARIMA and ARTXP </li></ul></ul></ul><ul><ul><ul><li>“ What-If” analysis </li></ul></ul></ul><ul><li>Microsoft Data Mining Framework </li></ul><ul><ul><li>Supplements CRISP-DM </li></ul></ul>
  35. 35. Microsoft DM Competitors <ul><li>SAS , largest market share of DM, specialised product for traditional experts </li></ul><ul><li>SPSS (Clementine), strength in statistical analysis </li></ul><ul><li>IBM (Intelligent Miner) tied to DB2, interoperates with Microsoft through PMML </li></ul><ul><li>Oracle (10g), supports Java APIs </li></ul><ul><li>Angoss (KnowledgeSTUDIO), result visualisation, works with SQL Server </li></ul><ul><li>KXEN , supports OLAP and Excel </li></ul>
  36. 37. <ul><li>Data mining: Discovering interesting patterns from large amounts of data </li></ul><ul><li>A natural evolution of database technology, in great demand, with wide applications </li></ul><ul><li>Includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation </li></ul><ul><li>Mining can be performed in a variety of information repositories </li></ul><ul><li>Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. </li></ul><ul><li>Data mining systems and architectures </li></ul><ul><li>Major issues in data mining; Let’s mine for valuable gems of knowledge in our databases! </li></ul>