Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Introduction <ul><li>Motivation: Why data mining? </li></ul><ul><li>What is data mining? </li></ul><ul><li>Data Mining: On what kind of data? </li></ul><ul><li>Data mining functionality </li></ul><ul><li>Are all the patterns interesting? </li></ul><ul><li>Classification of data mining systems </li></ul><ul><li>Major issues in data mining </li></ul>
  2. 2. <ul><li>Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory </li></ul><ul><li>It’s more aggressive cousin: </li></ul><ul><ul><li>Disk storage “capacity” doubles every 9 months </li></ul></ul>Usama Fayyad <ul><li>What do the two “laws” combined produce? </li></ul><ul><ul><li>A rapidly growing gap between our ability to generate data, and our ability to make use of it. </li></ul></ul>
  3. 3. <ul><li>Library of Congress, a proxy for recorded knowledge, is at about 10 Terabytes of text </li></ul><ul><li>All LoC, video, audio, etc is under 3 Petabytes </li></ul><ul><li>Etymology: </li></ul><ul><ul><li>Gigabyte (10 9 ) comes from Latin Gigas , for Giant </li></ul></ul><ul><ul><li>Terabyte (10 12 ) comes from Greek Teras for Monster </li></ul></ul><ul><ul><ul><li>Jim Gray used to call Terabytes, “Terror Bytes”… </li></ul></ul></ul><ul><ul><li>Names become more boring: Peta, Exa, then </li></ul></ul><ul><ul><ul><li>Zetta (10 21 ) -- ultimate (last letter) </li></ul></ul></ul><ul><ul><ul><li>Yotta (10 24 ) -- afterthought… </li></ul></ul></ul><ul><li>As of 2000, 11% of all information generated by humankind was generated in 1999 alone </li></ul>Usama Fayyad
  4. 4. Evolution of Database Technology <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation, IMS and network DBMS </li></ul></ul><ul><li>1970s: </li></ul><ul><ul><li>Relational data model, relational DBMS implementation </li></ul></ul><ul><li>1980s: </li></ul><ul><ul><li>RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) </li></ul></ul><ul><li>1990s—2000s: </li></ul><ul><ul><li>Data mining and data warehousing, multimedia databases, and Web databases </li></ul></ul>
  5. 5. Motivation: “Necessity is the Mother of Invention” <ul><li>Data explosion problem </li></ul><ul><ul><li>Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories </li></ul></ul><ul><li>We are drowning in data, but starving for knowledge! </li></ul><ul><li>Solution: Data warehousing and data mining </li></ul><ul><ul><li>Data warehousing and on-line analytical processing </li></ul></ul><ul><ul><li>Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases </li></ul></ul>
  6. 6. What is Data Mining? <ul><li>Finding interesting structure in data </li></ul><ul><li>Structure: refers to statistical patterns, predictive models, hidden relationships </li></ul><ul><li>Examples of tasks addressed by Data Mining </li></ul><ul><ul><li>Predictive Modeling (classification, regression) </li></ul></ul><ul><ul><li>Segmentation (Data Clustering ) </li></ul></ul><ul><ul><li>Affinity (Summarization) </li></ul></ul><ul><ul><ul><li>relations between fields, associations, visualization </li></ul></ul></ul>
  7. 7. What Is Data Mining? <ul><li>Data mining (knowledge discovery in databases): </li></ul><ul><ul><li>Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information or patterns from data in large databases </li></ul></ul><ul><li>Alternative names: </li></ul><ul><ul><li>Data mining: a misnomer? </li></ul></ul><ul><ul><li>Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. </li></ul></ul><ul><li>What is not data mining? </li></ul><ul><ul><li>query processing. </li></ul></ul><ul><ul><li>Expert systems or small ML/statistical programs </li></ul></ul>
  8. 8. Why Data Mining? — Potential Applications <ul><li>Database analysis and decision support </li></ul><ul><ul><li>Market analysis and management </li></ul></ul><ul><ul><ul><li>target marketing, customer relation management, market basket analysis, cross selling, market segmentation </li></ul></ul></ul><ul><ul><li>Risk analysis and management </li></ul></ul><ul><ul><ul><li>Forecasting, customer retention, improved underwriting, quality control, competitive analysis </li></ul></ul></ul><ul><ul><li>Fraud detection and management </li></ul></ul><ul><li>Other Applications </li></ul><ul><ul><li>Text mining (news group, email, documents) and Web analysis. </li></ul></ul><ul><ul><li>Intelligent query answering </li></ul></ul>
  9. 9. Specific applications <ul><li>Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction can use demographic, diet and clinical measurements for that patient </li></ul><ul><li>Predict the price of a stock six months from now, on the basis of company performance measures and economic data </li></ul><ul><li>Identify the numbers in a handwritten ZIP code, from a digitized image </li></ul><ul><li>Estimate the amount of glucose in the blood of a diabetic person from the infrared spectrum of the person’s blood </li></ul>
  10. 10. Specific applications <ul><li>Identify the risk factors for prostate cancer, based on clinical and demographic variables </li></ul><ul><li>Identify significant financial news stories in a stream of news stories </li></ul><ul><li>Identify groups of web-customers with similar browsing behavior </li></ul><ul><li>Distinguish pornographic from non-pornographic web pages </li></ul>
  11. 11. Data Mining: A KDD Process <ul><ul><li>Data mining: the core of knowledge discovery process. </li></ul></ul>Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
  12. 12. Architecture of a Typical Data Mining System Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  13. 13. Data Mining: On What Kind of Data? <ul><li>Relational databases </li></ul><ul><li>Data warehouses </li></ul><ul><li>Transactional databases </li></ul><ul><li>Advanced DB and information repositories </li></ul><ul><ul><li>Object-oriented and object-relational databases </li></ul></ul><ul><ul><li>Spatial databases </li></ul></ul><ul><ul><li>Time-series data and temporal data </li></ul></ul><ul><ul><li>Text databases and multimedia databases </li></ul></ul><ul><ul><li>Heterogeneous and legacy databases </li></ul></ul><ul><ul><li>WWW </li></ul></ul>
  14. 14. Data Mining Functionalities (1) <ul><li>Exploratory Data Analysis </li></ul><ul><ul><li>Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions </li></ul></ul><ul><li>Association </li></ul><ul><ul><li>age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%] </li></ul></ul>
  15. 15. Data Mining Functionalities (2) <ul><li>Classification and Prediction </li></ul><ul><ul><li>Finding models (functions) that describe and distinguish classes or concepts for future prediction </li></ul></ul><ul><ul><li>Presentation: decision-tree, classification rule, neural network </li></ul></ul><ul><ul><li>Prediction: Predict some unknown or missing numerical values </li></ul></ul><ul><li>Descriptive Modeling </li></ul><ul><ul><li>Cluster Analysis. Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns </li></ul></ul><ul><ul><li>Density Estimation </li></ul></ul>
  16. 16. Data Mining Functionalities (3) <ul><li>Outlier analysis </li></ul><ul><ul><li>Outlier: a data object that does not comply with the general behavior of the data </li></ul></ul><ul><ul><li>It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis </li></ul></ul><ul><li>Trend and evolution analysis </li></ul><ul><ul><li>Trend and deviation: regression analysis </li></ul></ul><ul><ul><li>Sequential pattern mining, periodicity analysis </li></ul></ul><ul><ul><li>Similarity-based analysis </li></ul></ul><ul><li>Other pattern-directed or statistical analyses </li></ul>
  17. 17. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
  18. 18. Major Issues in Data Mining (1) <ul><li>Mining methodology and user interaction </li></ul><ul><ul><li>Mining different kinds of knowledge in databases </li></ul></ul><ul><ul><li>Interactive mining of knowledge at multiple levels of abstraction </li></ul></ul><ul><ul><li>Incorporation of background knowledge </li></ul></ul><ul><ul><li>Data mining query languages and ad-hoc data mining </li></ul></ul><ul><ul><li>Expression and visualization of data mining results </li></ul></ul><ul><ul><li>Handling noise and incomplete data </li></ul></ul><ul><ul><li>Pattern evaluation: the interestingness problem </li></ul></ul><ul><li>Performance and scalability </li></ul><ul><ul><li>Efficiency and scalability of data mining algorithms </li></ul></ul><ul><ul><li>Parallel, distributed and incremental mining methods </li></ul></ul>
  19. 19. Major Issues in Data Mining (2) <ul><li>Issues relating to the diversity of data types </li></ul><ul><ul><li>Handling relational and complex types of data </li></ul></ul><ul><ul><li>Mining information from heterogeneous databases and global information systems (WWW) </li></ul></ul><ul><li>Issues related to applications and social impacts </li></ul><ul><ul><li>Application of discovered knowledge </li></ul></ul><ul><ul><ul><li>Domain-specific data mining tools </li></ul></ul></ul><ul><ul><ul><li>Intelligent query answering </li></ul></ul></ul><ul><ul><ul><li>Process control and decision making </li></ul></ul></ul><ul><ul><li>Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem </li></ul></ul><ul><ul><li>Protection of data security, integrity, and privacy </li></ul></ul>