PowerPoint-Präsentation

910 views
871 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
910
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PowerPoint-Präsentation

  1. 1. Data Mining - Data Understanding/Data Preparation - Michael Möhring Universität Koblenz-Landau Institut für Wirtschafts-und Verwaltungsinformatik Summer 2008
  2. 2. CRISP-Reference Model Data Mining Techniques! Data Mining Summer2008 © Michael Möhring 2
  3. 3. Data Understanding Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Collect Initial Data Initial Data Describe data Collection Report • Amount, variable types • Coding schemes Describe Data Data Description Explore data Report • Simple data analysis (e.g. tables, charts, statistical measures) Explore Data Data • Address data mining goals Exploration Report • Help to formulate hypotheses • Shape data transformation tasks ( data Verify Data preparation) Quality Data Quality Report Verify data quality • Detect missing data, measurement errors, bad coding, inconsistencies … Data Mining Summer2008 © Michael Möhring 3
  4. 4. Data View: Input (Instances, Attributes) Instances/Records/ Attributes/Fields: Set of features, describing Objects instances (e.g. income, sex, education level) on a (e.g. Individuals) certain “Level of Measurement” Data Mining Summer2008 © Michael Möhring 4
  5. 5. Attributes Types („Levels of Measurement“) Qualitative (Non-Metric) Quantitative (Metric) Nominal Ordinal Interval Ratio Description Values serve only as Impose Values not only Values for labels or names order on ordered but which the (no relation is implied values measured in measurement among nominal values (addition and fixed and equal scheme (e.g. ordering, subtraction distances defines a zero distance measure)) don’t make point sense) Allowed =, <> =, <>, =, <>, All operations <, <=, >, >= <, <=, >, >=, mathematical +, - operations Examples Sex (male, female) Temperature Temperature Age, Voting decision (e.g. (hot, mild, (e.g. in Celsius) Weight CDU, SPD, …) cool) „Binary“ (True/False) Education (very good, good, …) Data Mining Summer2008 © Michael Möhring 5
  6. 6. Data Description (Attribute Types) Data Mining Summer2008 © Michael Möhring 6
  7. 7. Attribute Types (Clementine) • Range Numeric values (e.g. range of 0–100, 0.75–1.25)  ratio/interval data type, ordinal data type • Flag Data with two distinct values (e.g. Yes/No, 1/2)  nominal-binary data type • (Ordered) Set Data with multiple distinct values, each treated as a member of a set (e.g. small/medium/large, 1/2/3/4/5)  nominal data type; Ordered Set: ordinal data type • Discrete String values when an exact number of distinct values is unknown. Once data has been read, the type will be flag, set, or typeless  nominal-binary/nominal data type • Typeless Data that does not conform to any of the above types or for set types with too many members (e.g. an account number) Data Mining Summer2008 © Michael Möhring 7
  8. 8. Data Understanding: Univariate -1- Characterising single variables by using frequency distributions (e.g. histograms, distributions) and statistical measures (e.g. mode, median, mean, variance, standard deviation, ...)  „Level of measurement“!!! Data Mining Summer2008 © Michael Möhring 8
  9. 9. Excursion: Univariate Data Analysis -1- Data Mining Summer2008 © Michael Möhring 9
  10. 10. Excursion: Univariate Data Analysis -2- Data Mining Summer2008 © Michael Möhring 10
  11. 11. Dispersion Example: Entrophy Entrophy: Deviation of a set of examples X classified in k categories k H(X ) hi log2hi (hi : relative frequencyof categoryi) i 1,hi 0 • If all objects concentrate on only one category, the entrophy has its minimal value 0  Maximal purity/inequality H hj log2 hj 1 log21 0 • If the objects are distributed uniformly on all categories, the entrophy has its maximal value log2k  Minimal purity/inequality k1 1 1 H log2 k log2 k log2 k i 1k k k Data Mining Summer2008 © Michael Möhring 11
  12. 12. Excursion: Univariate Data Analysis -3- Data Mining Summer2008 © Michael Möhring 12
  13. 13. Excursion: Univariate Data Analysis -4- <0 =0 >0 <3 =3 >3 Data Mining Summer2008 © Michael Möhring 13
  14. 14. Data Understanding: Bivariate -1- Characterising the relationship between two variables by using frequency distributions  crosstables  scatterplots Data Mining Summer2008 © Michael Möhring 14
  15. 15. Data Understanding: Bivariate -2- Characterising the relationship between two variables by using association/correlation measures (e.g. Lambda, Gamma, Pearson, Regression coefficient)  „Level of measurement“!!! Statistic Level of measurement qualitative metric nominal ordinal interval ratio Phi-Coefficient + + + + Cramers V2 + + + + Lambda + + + + Kendalls Taua,b,c - + + + Gamma - + + + Additional attributes: Somers„ d - + + + • Direction Pearson - - + + • Symmetric/Asymmetric Regression - - + + • Codomain coefficient b1 •01 • -1  +1 Data Mining Summer2008 © Michael Möhring 15
  16. 16. Example: Bivariate, linear Regression Data Mining Summer2008 © Michael Möhring 16
  17. 17. Finding the Best Linear Approximation Question: Which linear function ŷ = b0 + b1x describes the relationship between x and y best? Solution: Find ŷ = b0 + b1x such that the difference between y and ŷ is the smallest possible Task: Estimation of b0 and b1! Estimation Error (Residual): ei = yi – ŷi, with yi = b0 + b1xi + e  Minimising the residual variance: Ordinary Least Square (OLS) estimation Data Mining Summer2008 © Michael Möhring 17
  18. 18. Calculating b0 Partial derivative b0 Solution for b0 with , Data Mining Summer2008 © Michael Möhring 18
  19. 19. Calculating b1 Partial derivative b0 Solution for b0 Inserting b0 Inserting Data Mining Summer2008 © Michael Möhring 19
  20. 20. (Linear) Regression Data Mining Summer2008 © Michael Möhring 20
  21. 21. Using Standardised Variables X and Y Standardisation means replacing This means that Data Mining Summer2008 © Michael Möhring 21
  22. 22. Linear Regression in Clementine credit amount = 2982,684 + 8,118 age credit amount* = 0,033 age* Data Mining Summer2008 © Michael Möhring 22
  23. 23. CRISP-Reference Model Supported by Data Warehouses! Data Mining Techniques! Data Mining Summer2008 © Michael Möhring 23
  24. 24. Data Preparation/Data Cleaning (Quelle: www.kdnuggets.de) „... data preparation steps may take anywhere from 50% to 90% of the time and effort of the entire knowledge discovery process„ (TwoCrows 1999, p.23)  Ensuring data quality is the most important and time consuming tasks in data mining processes Data Mining Summer2008 © Michael Möhring 24
  25. 25. Data Preparation Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Data Set Data Set Description Select Data Rationale for Inclusion/ Exclusion • Selecting objects (records) and Clean variables (fields) Data Data Cleaning  data mining goal Report • Including/Excluding incomplete records Construct (e.g. Missing values, correcting data errors) Data Derived Generated • Deriving new variables (i.e. new Attributes Records important facts from existing variables) Integrate • Adding new records (e.g. purchase data Data Merged to customer data) coming from different Data sources Format • Aggregating, sorting, splitting records Data Reformatted Data Data Mining Summer2008 © Michael Möhring 25
  26. 26. Data Preparation/Transformation -2- • Unifying different field names (variable names) ID Date … ID Datum … 371 9/24/03 825 6/30/03 433 7/13/03 736 2/22/03 • Unifying different field contents (variable values) • e.g.: Date ID Date … ID Date … ID Date … 371 9/24/03 526 Sep 24, 2003 825 24.09.03 433 7/13/03 212 July 13, 2003 736 13.07.03 • e.g.: Gender ID Gender … ID Gender … ID Gender … 371 m 825 male 526 1 433 f 736 female 212 0 Data Mining Summer2008 © Michael Möhring 26
  27. 27. Data Preparation/Transformation -3- • Creating/Adapting of aggregation levels • Discretization of variables (e.g. Age) ID Age … ID Age AgeGroup … 1: 18-25 371 23 371 23 1 2: 26-35 433 48 433 48 4 3: 36-45 4: 46-55 825 33 825 33 2 5: 56-65 736 66 736 66 6 6: > 65 • Classification of transactions ID Jever Becks Veltins … ID Jever Becks Veltins Beer … 371 1 1 371 1 1 2 433 1 433 1 1 825 1 825 1 1 736 736 Data Mining Summer2008 © Michael Möhring 27
  28. 28. Treating Missing Values Data Mining Summer2008 © Michael Möhring 28
  29. 29. Data Preparation Tasks (e.g.) -3- • Standardize different “missing value” codes ID Age Gender Date Credit card … 371 23 m 23.03.2003 j 433 999 w 28.05.2003 n 825 33 w 16.10.2003 . 736 66 NA 04.11.2003 j • Unifying Format • Treating data sets with „missing values“: • Omitting the field with missing values • Omitting the record with missing values • Using the missing value as an independent datum • Filling in missing values with default values (e.g. The average of all valid values in the corresponding variable) • Preparing/Converting variable (data) types (e.g. string nominal, nominal  numeric)  Data Mining methods often require certain variable types Data Mining Summer2008 © Michael Möhring 29
  30. 30. Data Preparation/Recoding -4- • Correction of wrong field contents (e.g. during data collection, integrating different data sources) • Recoding of variables • Aggregating by summarizing values with small frequencies (e.g.) New category: „> 8000“ Data Mining Summer2008 © Michael Möhring 30
  31. 31. Constructing New Variables  Derive (e.g.) B_GEBDATUM (string)  New Variable: AGE (ordinal) “1973-01-17”  32  Clementine: 2005 – to_integer(substring,1,4,B_GEBDATUM)) Data Mining Summer2008 © Michael Möhring 31
  32. 32. Constructing New Variables  Binning (e.g.) Data Mining Summer2008 © Michael Möhring 32
  33. 33. Use of Data Mining Software (Quelle: www.kdnuggets.de) Data Mining Summer2008 © Michael Möhring 33
  34. 34. Statistical/Data Mining Software: SPSS • Standardized data analysis systems with a big number of tested and numerical stable algorithms • Classical, menu-based user interface http://www.spss.com • Isolated definition of single process steps • Poor transparency of the data mining process (…  data understanding  data preparation  modelling  …) • Difficult integration of new approaches (e.g. machine learning algorithms) • No parallel application of different algorithms on the same data set Data Mining Summer2008 © Michael Möhring 34
  35. 35. SPSS – User Interface Data Mining Summer2008 © Michael Möhring 35
  36. 36. Data Mining Software: Clementine • Visual approach to data mining Each process in Clementine is represented by an icon, or node, that a user connects to form a stream representing the flow of data through a variety of processes http://www.spss.com/clementine • Data understanding and data preparation facilities • Visualization and basic statistics methods (e.g. frequencies, plots) • Record based function nodes (e.g. merge) • Field (Variable) based function nodes (e.g. derive) • Advanced data analysis technology (e.g. • Clustering/Segmentation • Rule Induction/Association • Regression • Classification • ... Data Mining Summer2008 © Michael Möhring 36
  37. 37. Clementine – User Interface Stream Canvas: Data/Modeling Manager: Projects Tool: Tools: Output/Object The main all streams, data To create and used to Groups ofareamanage Displays nodes work in projects. mining Clementine containing graphical outputs, and generated  Clementine-Classes representation of data view models of a current  CRISP-DM view. mining tasks save, add to session (e.g. a current project) Data Mining Summer2008 © Michael Möhring 37
  38. 38. Source Nodes ( Data Understanding) Data Mining Summer2008 © Michael Möhring 38
  39. 39. Graph Nodes ( Data Understanding/Evaluation) Data Mining Summer2008 © Michael Möhring 39
  40. 40. Output Nodes ( DataUnderstanding/Evaluation) Data Mining Summer2008 © Michael Möhring 40
  41. 41. Output Nodes  e.g. Data Audit Node Data Mining Summer2008 © Michael Möhring 41
  42. 42. Record Nodes ( Data Preparation) Data Mining Summer2008 © Michael Möhring 42
  43. 43. Field Nodes ( Data Preparation) Data Mining Summer2008 © Michael Möhring 43
  44. 44. Modeling Nodes ( Modeling) Data Mining Summer2008 © Michael Möhring 44
  45. 45. Parallel Application of Different Models Data Mining Summer2008 © Michael Möhring 45
  46. 46. CLEM - SuperNodes Clementine Language for Expression Manipulation (CLEM) for reasoning about and manipulating the data that flows along Clementine streams CLEM is used within nodes to • compare and evaluate conditions on record fields • derive values for new fields • derive new values for existing fields • reason about the sequence of records • insert data from records into reports Grouping of multiple nodes into a single node by encapsulating sections of a data stream. This provides numerous benefits to the data miner: • Streams are more neat and manageable • Nodes can be combined into a business-specific SuperNode • SuperNodes can be exported to libraries for reuse in multiple data mining projects • Source SuperNodes (i.e. data flow must from SuperNode) • Manipulation SuperNodes (i.e. data flow in and out) • Terminal SuperNodes (i.e. data must flow into SuperNode) Data Mining Summer2008 © Michael Möhring 46
  47. 47. Data Set (Credit Scoring) Credit Data  1000 Bank customers  22 Variables (from: Michie u.a. 1994, p.153) Data Mining Summer2008 © Michael Möhring 47
  48. 48. Data Set (Credit Scoring) - 1 - Credit Data  1000 Bank customers  22 Variables (from: Michie u.a. 1994, p.153) Data Mining Summer2008 © Michael Möhring 48
  49. 49. Data Set (Credit Scoring) - 2 - Data Mining Summer2008 © Michael Möhring 49
  50. 50. Data Set (Credit Scoring) - 3 - Data Mining Summer2008 © Michael Möhring 50

×