Introduction to DataMining

1,678 views

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,678
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
130
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction to DataMining

  1. 1. 1 INTRODUCTION
  2. 2. 2 A young, fast growing and promising field
  3. 3. INTRODUCTION 3      Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD) Extracting hidden information An interdisciplinary subfield of computer science The computational process of discovering patterns in large data sets Involving methods at the intersection of Artificial intelligence, Machine learning, Statistics, and Database systems.
  4. 4. INTORODUCTION(CONTD..) 4 The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves • database and data management aspects •    • data pre-processing model inference considerations complexity considerations, post-processing of discovered structures, visualization, and online updating.
  5. 5. Why Data Mining? 5  The Explosive Growth of Data: from terabytes to petabytes  Eg: Global backbone telecommunication network carry tens of petabytes everyday (1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras,…
  6. 6. Why Data Mining? 6 “Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets
  7. 7. What Motivated Data Mining? 7  We are drowning in data, but starving for knowledge!
  8. 8. Evolution of Database Technology 8 Data mining can be viewed as a result of natural evolution of IT  1960s:   1970s:   Data collection, database creation and network DBMS Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)
  9. 9. Evolution of Database Technology 9  1990s:   Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems
  10. 10. 10
  11. 11. What Is Data Mining? 11  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names   Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  12. 12. Data Mining: Confluence of Multiple Disciplines 12 Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines
  13. 13. Knowledge Discovery (KDD) Process 13  Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
  14. 14. Knowledge Process 14 1. 2. 3. 4. 5. 6. 7. Data cleaning – to remove noise and inconsistent data Data integration – to combine multiple source Data selection – to retrieve relevant data for analysis Data transformation – to transform data into appropriate form for data mining Data mining- An essential process where intelligent methods are applied to extract data patterns Pattern Evaluation-Identify truly interesting patterns representing knowledge based on interestingness measure Knowledge presentation-visualization and representation techniques
  15. 15. Example: A Web Mining Framework 15  Web mining usually involves         Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
  16. 16. Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 16 DBA
  17. 17. KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction  Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities 17
  18. 18. Data Mining: On What Kinds of Data? 18  Database-oriented data sets and applications   Relational database, data warehouse, transactional database Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data  Multimedia database  Text databases  The World-Wide Web
  19. 19. RDBMS 19     A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model. Data in a single table represents a relation. Each table schema must identify a column or group of columns, called the p rim a ry ke y , to uniquely identify each row. A relationship can then be established between each row in the table and a row in another table by creating a fo re ig n ke y , a column or group of columns in one table that points to the primary key of another table.
  20. 20. RDBMS 20 • • • • • Database normalization: The relational model offers various levels of refinement of table organization and reorganization . DBMS of a relational database is called an RDBMS, and is the software of a relational database. The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory. Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules. A relational database has become the predominant choice in storing data.
  21. 21. 21 Relational database terminology. A relation is defined as a set of tuples that have the same attributes
  22. 22. RDMS(contd..) 22 Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch) Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category) Tables: used to represent the relationship between or among multiple entities  Database queries(SQL): For data accessing using relational operations such as join, selection and projection
  23. 23. Mining Relational databases 23      Can go further by searching for trends or data patterns Examples Analyze customer data to predict the risk of customers based on their income ,age Detect deviations: sales comparison with previous year RDBMS are one of the most commonly available and richest information repositories for data mining
  24. 24. What is a Data Warehouse? 24  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses
  25. 25. DATA WAREHOUSES 25 Is a repository of information collected from multiple sources, stored under a unified schema. Constructed via  Data cleaning  Data integration  Data transformation  Data Loading and periodic data refreshing 
  26. 26. 26
  27. 27. DATA WAREHOUSES(contd…) 27   Data warehouse is modeled by a multidimensional data structure Data cube: precomputation &fast access of summarized data   Each dimension corresponds to an attribute or a set of attributes in a schema Each cell stores the value of some aggregate measure (count, sum etc)  Example:  In Allelectronics the cube has three dimension : • Address(with city values, U S A, Canada, Mexico) • Time (with quarter values Q1,Q2,Q3,Q4) • Item(with type values )
  28. 28. Multidimensional Data 28 Sales volume as a function of product, month, and region Re g io n Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product  Product City Office Month Month Day Week
  29. 29. A Sample Data Cube 29 Pr TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TVs in U.S.A. U.S.A Canada Mexico sum Country od uc t Date
  30. 30. Data mining functionalities 30  Tasks can be classified :   Predictive(makes prediction about values of data using known results found from different data) Descriptive( characterize properties of a target data set)  Explore the properties of the data examined Data mining functionalities are used to specify the kinds of patterns      Characterization and Discrimination The mining of frequent patterns, associations and correlations Classification and regression Cluster analysis Outlier analysis
  31. 31. Characterization and Discrimination 31   Data characterization is a summarization of the general characteristics or features of a target class of data Output of characterization can be presented in various forms  Pie charts  Bar charts  Curves  multidimensional data cube  Multidimensional tables Descriptions presented in generalized relations- Characteristic rules Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s lle this can be view in any dimension, such as on occupation to view these customers according to their type of employment.
  32. 32. Data Discrimination 32     Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class Output representation similar to characterization description Discrimination description expressed in the form of rules –Discrimination rules Target and contrasting class specified by the user Example:  Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e p e rio d
  33. 33. Mining Frequent Patterns, Associations, Correlations 33  Frequent pattern Frequent item sets(Milk, bread)  Frequent subsequences(Latop ,digital camera ,memory card)  Frequent sub structures (graphs ,trees) Mining frequent patterns leads to the discovery of interesting associations and correlation within data. 
  34. 34. Association analysis(example) 34 Item frequently purchased together buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%] X - a variable representing a customer A confidence or certainty – 50%(chance) 1%(under analysis) Association rule- with single-dimension association rules “computer => software[1%,50%]”. Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)
  35. 35. Classification and Regression for Predictive Analysis 35    Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts Model derived from analysis of a set of training data Models are represented as    Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks  Regression: Statistical methodology for numeric prediction
  36. 36. 36 Cluster Analysis and Outlier Analysis  Cluster Analysis:    Determining similarity among data on predefined attributes The most similar data are grouped into clusters Outlier Analysis    Outliers: The dataset contain objects that do not required for the model of the data Analysis of outlier data is referred to as Outlier Analysis or Anomaly mining Detected using statstical tests
  37. 37. Which Technologies Are Used? Machine Learning Applications Algorithm Pattern Recognition Statistics Visualization Data Mining Database Technology High-Performance Computing 37
  38. 38. Potential Applications of Data Mining Where there are data there are data mining applications 38  Data analysis and decision support ( Business Intelligence)  Market analysis and management   Risk analysis and management    Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis
  39. 39. Major Issues in Data Mining (1)  Mining Methodology   Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment  Handling noise, uncertainty, and incompleteness of data   Mining various and new kinds of knowledge Pattern evaluation and pattern- or constraint-guided mining User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results 39
  40. 40. Major Issues in Data Mining (2)  Efficiency and Scalability    Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types    Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining 40

×