Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

WEKA: Data Mining Input Concepts Instances And Attributes


Published on

Data Mining Input Concepts Instances And Attributes

Published in: Technology

WEKA: Data Mining Input Concepts Instances And Attributes

  1. 1. Data Mining Input: Concepts, Instances, and Attributes<br />
  2. 2. Input takes the following forms:<br /><ul><li>Concept: The thing that is to be learned is called the concept. Concept should be :
  3. 3. Intelligible in that it can be understood
  4. 4. Operational in that it can be applied to actual examples
  5. 5. Instances: The data present consists of various instances of the class. E.g. the table below consists of 2 instances
  6. 6. Attributes: Each instance of the class has various attributes. E.g. the table bellow consists of two attributes {Name, Age}</li></li></ul><li>Types of learning in data mining<br /><ul><li>Classification learning:
  7. 7. Learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples
  8. 8. Also called supervised learning
  9. 9. E.g. Classification rules for the weather forecasting problem</li></ul> If outlook = sunny and humidity = high then play = no<br /> If outlook = rainy and windy = true then play = no<br /> If outlook = overcast then play = yes<br />
  10. 10. <ul><li>Numeric prediction
  11. 11. Same as classification learning but the outcome to be predicted is not a discreet class but a numeric quantity
  12. 12. Clustering
  13. 13. Groups of examples that belong together are sought and clubbed together in a cluster
  14. 14. E.g. based on the data with a bank the following relation between debt and income was seen:</li></li></ul><li><ul><li>Association rules
  15. 15. Any association among features is sought, not just ones that predict a particular class value
  16. 16. It predicts any attribute, not just the class
  17. 17. It can predict more than one attribute value at a time
  18. 18. E.g. from the following super market data it can be concluded: If milk and bread is bought, customers also buy butter</li></li></ul><li>Few important terms…<br /><ul><li>Concept description: Output produced by a learning scheme
  19. 19. Flat file: Each dataset is represented as a matrix of instances versus attributes, which in database terms is a single relationship, or a flat file
  20. 20. Closed world assumption: The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called closed world assumption</li></li></ul><li>Steps to prepare data<br />Data assembly and aggregation<br />Data integration <br />Data Cleaning <br />4. General preparation<br />
  21. 21. Data assembly and aggregation<br /><ul><li>Instances which are there in the input should be independent
  22. 22. Independence can be achieved by de-normalization
  23. 23. In database terms, take two relations and join them together to make one, a process of flattening that is technically called de-normalization
  24. 24. Possible with finite set of finite relations</li></li></ul><li>Input is a family tree<br />
  25. 25. We are trying to find ‘Sister of’ relation ship<br />Each row of tree mapped to instances:<br />We cant make sense of this with respect to our requirement or concept. Therefore …….<br />
  26. 26. We de-normalize these tables to get:<br />Here we can clearly see the ‘Sister of’ relationship<br />
  27. 27. Problems with de-normalization:<br />If relationship between large number of items is required then tables will be huge<br />It produces irregularities in data that are completely spurious<br />Relations might not be finite (use: Inductive logic programming)<br />Overlay data: Sometimes data relevant to the problem at hand needs to be collected from outside of the organization. This is called overlay data.<br />
  28. 28. Data Integration<br />Integration of system wide databases is difficult because different departments will use/have:<br />Different style of record keeping<br />Different conventions <br />Different degrees of data aggregations etc<br />Different types of errors<br />Different time period<br />Different primary keys<br /> These issues are taken care by the idea of company wide databases, a process called as data warehousing <br />
  29. 29. Data Cleaning<br />Data cleaning is the careful checking of data <br />It helps in resolving many architectural issues with different databases<br />Data cleaning usually requires good domain knowledge<br />
  30. 30. Attribute-Relation File Format (ARFF)<br />Definition: An ARFF file is an ASCII text file that describes a list of instances sharing a set of attributes<br />Conventions used in ARFF :<br />ARFF Header <br />Line beginning with % are comments <br />To declare relation: @relation &lt;name of relation&gt;<br />To declare attribute: @attribute &lt;attribute&gt; &lt;data type&gt;<br />ARFF Data Section<br />To start the actual data: @data, followed by row wise CS data<br />
  31. 31. Data type for ARFF:<br />Numeric can be real or integer numbers<br />Nominal values are defined by providing &lt;nominal-specification&gt; listing the possible values: {nm-value1, nm-value2,…} e.g. {yes, no}<br />Values separated by space must be quoted<br />String attributes allow us to create attributes containing arbitrary textual values <br />Date type is used as: @attribute &lt;name&gt; date [&lt;date-format&gt;]<br />The default date format is ISO-8601 combined date and time format:”yyyy-MM-dd’T’HH:mm:ss” <br />Missing values are represented by ?<br />
  32. 32. Sparse ARFF files<br />Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented<br />Same header as ARFF but different data section. <br />Instead of representing each value in order, like this:<br />@data <br />0, X, 0, Y, “class A”<br />The non zero attributes are explicitly identified by attribute number(starting <br />from zero) and their value stated , like this:<br />@data<br />{1X, 3Y,4 “class A”}<br />
  33. 33. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at<br />