• Save
WEKA: Data Mining Input Concepts Instances And Attributes
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


WEKA: Data Mining Input Concepts Instances And Attributes



Data Mining Input Concepts Instances And Attributes

Data Mining Input Concepts Instances And Attributes



Total Views
Views on SlideShare
Embed Views



2 Embeds 40

http://www.slideshare.net 36
http://dataminingtools.net 4



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

WEKA: Data Mining Input Concepts Instances And Attributes Presentation Transcript

  • 1. Data Mining Input: Concepts, Instances, and Attributes
  • 2. Input takes the following forms:
    • Concept: The thing that is to be learned is called the concept. Concept should be :
    • 3. Intelligible in that it can be understood
    • 4. Operational in that it can be applied to actual examples
    • 5. Instances: The data present consists of various instances of the class. E.g. the table below consists of 2 instances
    • 6. Attributes: Each instance of the class has various attributes. E.g. the table bellow consists of two attributes {Name, Age}
  • Types of learning in data mining
    • Classification learning:
    • 7. Learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples
    • 8. Also called supervised learning
    • 9. E.g. Classification rules for the weather forecasting problem
    If outlook = sunny and humidity = high then play = no
    If outlook = rainy and windy = true then play = no
    If outlook = overcast then play = yes
  • 10.
    • Numeric prediction
    • 11. Same as classification learning but the outcome to be predicted is not a discreet class but a numeric quantity
    • 12. Clustering
    • 13. Groups of examples that belong together are sought and clubbed together in a cluster
    • 14. E.g. based on the data with a bank the following relation between debt and income was seen:
    • Association rules
    • 15. Any association among features is sought, not just ones that predict a particular class value
    • 16. It predicts any attribute, not just the class
    • 17. It can predict more than one attribute value at a time
    • 18. E.g. from the following super market data it can be concluded: If milk and bread is bought, customers also buy butter
  • Few important terms…
    • Concept description: Output produced by a learning scheme
    • 19. Flat file: Each dataset is represented as a matrix of instances versus attributes, which in database terms is a single relationship, or a flat file
    • 20. Closed world assumption: The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called closed world assumption
  • Steps to prepare data
    Data assembly and aggregation
    Data integration
    Data Cleaning
    4. General preparation
  • 21. Data assembly and aggregation
    • Instances which are there in the input should be independent
    • 22. Independence can be achieved by de-normalization
    • 23. In database terms, take two relations and join them together to make one, a process of flattening that is technically called de-normalization
    • 24. Possible with finite set of finite relations
  • Input is a family tree
  • 25. We are trying to find ‘Sister of’ relation ship
    Each row of tree mapped to instances:
    We cant make sense of this with respect to our requirement or concept. Therefore …….
  • 26. We de-normalize these tables to get:
    Here we can clearly see the ‘Sister of’ relationship
  • 27. Problems with de-normalization:
    If relationship between large number of items is required then tables will be huge
    It produces irregularities in data that are completely spurious
    Relations might not be finite (use: Inductive logic programming)
    Overlay data: Sometimes data relevant to the problem at hand needs to be collected from outside of the organization. This is called overlay data.
  • 28. Data Integration
    Integration of system wide databases is difficult because different departments will use/have:
    Different style of record keeping
    Different conventions
    Different degrees of data aggregations etc
    Different types of errors
    Different time period
    Different primary keys
    These issues are taken care by the idea of company wide databases, a process called as data warehousing
  • 29. Data Cleaning
    Data cleaning is the careful checking of data
    It helps in resolving many architectural issues with different databases
    Data cleaning usually requires good domain knowledge
  • 30. Attribute-Relation File Format (ARFF)
    Definition: An ARFF file is an ASCII text file that describes a list of instances sharing a set of attributes
    Conventions used in ARFF :
    ARFF Header
    Line beginning with % are comments
    To declare relation: @relation <name of relation>
    To declare attribute: @attribute <attribute> <data type>
    ARFF Data Section
    To start the actual data: @data, followed by row wise CS data
  • 31. Data type for ARFF:
    Numeric can be real or integer numbers
    Nominal values are defined by providing <nominal-specification> listing the possible values: {nm-value1, nm-value2,…} e.g. {yes, no}
    Values separated by space must be quoted
    String attributes allow us to create attributes containing arbitrary textual values
    Date type is used as: @attribute <name> date [<date-format>]
    The default date format is ISO-8601 combined date and time format:”yyyy-MM-dd’T’HH:mm:ss”
    Missing values are represented by ?
  • 32. Sparse ARFF files
    Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented
    Same header as ARFF but different data section.
    Instead of representing each value in order, like this:
    0, X, 0, Y, “class A”
    The non zero attributes are explicitly identified by attribute number(starting
    from zero) and their value stated , like this:
    {1X, 3Y,4 “class A”}
  • 33. Visit more self help tutorials
    Pick a tutorial of your choice and browse through it at your own pace.
    The tutorials section is free, self-guiding and will not involve any additional support.
    Visit us at www.dataminingtools.net