Machine Learning and Data Mining: 03 Data Representation

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Machine Learning and Data Mining: 03 Data Representation - Presentation Transcript

    1. The representation of data Machine Learning and Data Mining
    2. Lecture outline 2 Components of the input Concepts, instances, and attributes Attributes types Nominal Ordinal Interval Ratio Missing values and inaccurate values Prof. Pier Luca Lanzi
    3. Components of the input 3 Concepts Kinds of things that can be learned Instances the individual, independent examples of a concept Attributes measuring aspects of an instance Prof. Pier Luca Lanzi
    4. What’s in an example? 4 Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instances/dataset Represented as a single relation/flat file Rather restricted form of input No relationships between objects Most common form in practical data mining Prof. Pier Luca Lanzi
    5. What’s in an attribute? 5 Each instance is described by a fixed predefined set of features, its “attributes” But the number of attributes may vary in practice Possible solution: “irrelevant value” flag Related problem: existence of an attribute may depend of value of another one Possible attribute types (“levels of measurement”): Nominal, ordinal, interval and ratio Prof. Pier Luca Lanzi
    6. The contact lenses data 6 Age Spectacle prescription Astigmatism Tear production rate Recommended lenses Young Myope No Reduced None Young Myope No Normal Soft Young Myope Yes Reduced None Young Myope Yes Normal Hard Young Hypermetrope No Reduced None Young Hypermetrope No Normal Soft Young Hypermetrope Yes Reduced None Young Hypermetrope Yes Normal hard Pre-presbyopic Myope No Reduced None Pre-presbyopic Myope No Normal Soft Pre-presbyopic Myope Yes Reduced None Pre-presbyopic Myope Yes Normal Hard Pre-presbyopic Hypermetrope No Reduced None Pre-presbyopic Hypermetrope No Normal Soft Pre-presbyopic Hypermetrope Yes Reduced None Pre-presbyopic Hypermetrope Yes Normal None Presbyopic Myope No Reduced None Presbyopic Myope No Normal None Presbyopic Myope Yes Reduced None Presbyopic Myope Yes Normal Hard Presbyopic Hypermetrope No Reduced None Presbyopic Hypermetrope No Normal Soft Presbyopic Hypermetrope Yes Reduced None Presbyopic Hypermetrope Yes Normal None Prof. Pier Luca Lanzi
    7. The weather problem 7 Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes … … … … … Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes … … … … … Prof. Pier Luca Lanzi
    8. Nominal quantities 8 Values are distinct symbols Values themselves serve only as labels or names Example Attribute “outlook” from weather data Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values No ordering No distance measure Only equality tests can be performed Prof. Pier Luca Lanzi
    9. Ordinal quantities 9 Impose order on values No distance between values defined Example: The attribute “temperature” in weather data Values: “hot” > “mild” > “cool” Addition and subtraction don’t make sense Example rule: temperature < hot Þ play = yes Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”) Prof. Pier Luca Lanzi
    10. Interval quantities 10 Interval quantities are not only ordered but measured in fixed and equal units Examples Attribute “temperature” expressed in degrees Attribute “year” Difference of two values makes sense Sum or product doesn’t make sense Zero point is not defined Prof. Pier Luca Lanzi
    11. Ratio quantities 11 Ratio quantities are ones for which the measurement scheme defines a zero point Example Attribute “distance” Distance between an object and itself is zero Ratio quantities are treated as real numbers All mathematical operations are allowed Is there an “inherently” defined zero point? It depends on scientific knowledge E.g. Fahrenheit knew no lower limit to temperature Prof. Pier Luca Lanzi
    12. Attribute types used in practice 12 Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called “categorical”, ”enumerated”, or “discrete” But: “enumerated” and “discrete” imply order Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or “continuous” But: “continuous” implies mathematical continuity Prof. Pier Luca Lanzi
    13. Why specify attribute types? 13 Why Machine Learning algorithms need to know about attribute type? To be able to make right comparisons and learn correct concepts Example Outlook > “sunny” does not make sense, while Temperature > “cool” or Humidity > 70 does Additional uses of attribute type Check for valid values Deal with missing, etc. Prof. Pier Luca Lanzi
    14. Nominal vs. ordinal 14 Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age≤pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Prof. Pier Luca Lanzi
    15. Missing values 15 Frequently indicated by out-of-range entries Types: unknown, unrecorded, irrelevant Reasons: Malfunctioning equipment Changes in experimental design Collation of different datasets Measurement not possible Missing value may have significance in itself E.g. missing test in a medical examination Most schemes assume that is not the case “missing” may need to be coded as additional value Does absence of value have some significance? If it does, “missing” is a separate value If it does not, “missing” must be treated in a special way Prof. Pier Luca Lanzi
    16. Inaccurate values 16 What is the reason? data has not been collected for mining it What is the result? errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes, thus values need to be checked for consistency Typographical and measurement errors in numeric attributes, thus outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data Prof. Pier Luca Lanzi
    17. Getting to know the data 17 Simple visualization tools are very useful Nominal attributes: histograms Numeric attributes: graphs 2-D and 3-D plots show dependencies Need to consult domain experts When there is too much data to inspect, take a sample! Prof. Pier Luca Lanzi

    + Pier Luca LanziPier Luca Lanzi, 3 years ago

    custom

    2099 views, 1 favs, 5 embeds more stats

    Course "Machine Learning and Data Mining" for the d more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2099
      • 2041 on SlideShare
      • 58 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 0
    Most viewed embeds
    • 54 views on http://lizziesky.blogspot.com
    • 1 views on http://72.14.235.132
    • 1 views on https://gfw.appspot.com
    • 1 views on http://translate.googleusercontent.com
    • 1 views on http://203.208.39.132

    more

    All embeds
    • 54 views on http://lizziesky.blogspot.com
    • 1 views on http://72.14.235.132
    • 1 views on https://gfw.appspot.com
    • 1 views on http://translate.googleusercontent.com
    • 1 views on http://203.208.39.132

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories