• Like
  • Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Its all about data mining

on

  • 617 views

Its all about data mining, a brief introduction to the world of data mining.

Its all about data mining, a brief introduction to the world of data mining.

Statistics

Views

Total Views
617
Views on SlideShare
616
Embed Views
1

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Its all about data mining Its all about data mining Presentation Transcript

    • A Brief Presentation on Data Mining Jason Rodrigues Data Mining
      • Introduction
      • What is Data Mining?
      • Challenges and Trends
      • The origin of DM
      • Data Mining Tasks
      • Types of Data
      • Data Quality
      • Takeaways
      Agenda
    •  
    • What is Data Mining?
      • Definition:
        • Data mining is the process of extracting patterns from data.- Wikipedia
      • Process of semi-automatically analyzing large databases to find patterns that are:
        • valid: hold on new data with some certainty
        • novel: non-obvious to the system
        • useful: should be possible to act on the item
        • understandable: humans should be able to interpret the pattern
      • Other Definitions:
        • Non-trivial extraction of implicit, previously unknown and potentially useful information from data
        • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
      • Extracting or “mining” knowledge from large amounts of data
      • Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data
      • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
      What is Data Mining? (Other Definitions)
    • What is Data Mining?
      • What is Data Mining?
        • Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
        • Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
      • What is not Data Mining?
        • Look up phone number in phone directory
        • Query a Web search engine for information about “Amazon”
    • What is Data Mining?
      • Moore's Law:
        • The law is named after Intel co-founder Gordon E. Moore, who described the trend in his 1965 paper. The paper noted that number of components in integrated circuits had doubled every year from the invention of the integrated circuit in 1958 until 1965 and predicted that the trend would continue "for at least ten years"
        • The number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. The trend has continued for more than half a century and is not expected to stop until 2015 or later.
      • Moore's Law:
      What is Data Mining?
      • Data Analysis?
      • Data Warehouse?
      • Data Mining and Statistics
        • Standard Deviation
      • Data Mining and AI and Machine Learning
        • Identify Possible Heart Attack cases
      What is Data Mining? History 1980's 1990's 2000's 2010's
      • Commercial Viewpoint
      Data Mining Challenges and Trends
      • Scientific Viewpoint
      Data Mining Challenges and Trends
    • Data Mining Challenges and Trends
      • Computationally expensive to investigate all possibilities
      • Dealing with noise/missing information and errors in data
      • Choosing appropriate attributes/input representation
      • Finding the minimal attribute space
      • Finding adequate evaluation function(s)
      • Extracting meaningful information
      • Not overfitting
    • Predictive Analysis Presentation Exploration Discovery Passive Interactive Proactive Role of Software Business Insight Data Mining Challenges and Trends Canned reporting Ad-hoc reporting Online Analytical Processing Data mining
      • Prediction Methods
        • Use some variables to predict unknown or future values of other variables.
      • Description Methods
        • Find human-interpretable patterns that describe the data.
      Data Mining Tasks
      • Classification [Predictive]
      • Clustering [Descriptive]
      • Association Rule Discovery [Descriptive]
      • Sequential Pattern Discovery [Descriptive]
      • Regression [Predictive]
      • Deviation Detection [Predictive]
      Data Mining Tasks
    • Data Mining Tasks
      • Concept/Class description : Characterization and discrimination
        • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
      • Association ( correlation and causality)
        • Multi-dimensional or single-dimensional association
      • age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)
    • Data Mining Tasks
      • Classification and Prediction
        • Finding models (functions) that describe and distinguish classes or concepts for future prediction
        • Example: classify countries based on climate, or classify cars based on gas mileage
        • Presentation:
          • If-THEN rules, decision-tree, classification rule, neural network
        • Prediction: Predict some unknown or missing numerical values
      • Cluster analysis
        • Class label is unknown: Group data to form new classes,
          • Example: cluster houses to find distribution patterns
        • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
      Data Mining Tasks
    • Data Mining Tasks
      • Outlier analysis
        • Outlier: a data object that does not comply with the general behavior of the data
        • Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis
      • Trend and evolution analysis
        • Trend and deviation: regression analysis
        • Sequential pattern mining, periodicity analysis
    • Takeaways What is Data Mining? Challenges and Trends The origin of DM Data Mining Tasks
      • Collection of data objects and their attributes
      • An attribute is a property or characteristic of an object
        • Examples: eye color of a person, temperature, etc.
        • Attribute is also known as variable, field, characteristic, or feature
      • A collection of attributes describe an object
        • Object is also known as record, point, case, sample, entity, or instance
      What is Data? Attributes Objects
      • Attribute values are numbers or symbols assigned to an attribute
      • Distinction between attributes and attribute values
        • Same attribute can be mapped to different attribute values
          • Example: height can be measured in feet or meters
        • Different attributes can be mapped to the same set of values
          • Example: Attribute values for ID and age are integers
          • But properties of attribute values can be different
            • ID has no limit but age has a maximum and minimum value
      Attribute Values
      • There are different types of attributes
        • Nominal
          • Examples: ID numbers, eye color, zip codes
        • Ordinal
          • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
        • Interval
          • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
        • Ratio
          • Examples: temperature in Kelvin, length, time, counts
      Types of Attribute
      • The type of an attribute depends on which of the following properties it possesses:
        • Distinctness: = 
        • Order: < >
        • Addition: + -
        • Multiplication: * /
        • Nominal attribute: distinctness
        • Ordinal attribute: distinctness & order
        • Interval attribute: distinctness, order & addition
        • Ratio attribute: all 4 properties
      Properties of Attribute Values
    • Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, sex: { male, female } mode, entropy, contingency correlation,  2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, { good, better, best }, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
      • Discrete Attribute
        • Has only a finite or countably infinite set of values
        • Examples: zip codes, counts, or the set of words in a collection of documents
        • Often represented as integer variables.
        • Note: binary attributes are a special case of discrete attributes
      • Continuous Attribute
        • Has real numbers as attribute values
        • Examples: temperature, height, or weight.
        • Practically, real values can only be measured and represented using a finite number of digits.
        • Continuous attributes are typically represented as floating-point variables.
      Discreet and Continuous Attributes
      • Attribute values are numbers or symbols assigned to an attribute
      • Distinction between attributes and attribute values
        • Same attribute can be mapped to different attribute values
          • Example: height can be measured in feet or meters
        • Different attributes can be mapped to the same set of values
          • Example: Attribute values for ID and age are integers
          • But properties of attribute values can be different
            • ID has no limit but age has a maximum and minimum value
      Attribute Values
      • Record
        • Data Matrix
        • Document Data
        • Transaction Data
      • Graph
        • World Wide Web
        • Molecular Structures
      • Ordered
        • Spatial Data
        • Temporal Data
        • Sequential Data
        • Genetic Sequence Data
      Types of Record Sets
      • Data that consists of a collection of records, each of which consists of a fixed set of attributes
      Record Sets
      • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
      • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
      Data Matrics
      • Each document becomes a `term' vector,
        • each term is a component (attribute) of the vector,
        • the value of each component is the number of times the corresponding term occurs in the document.
      Document Data
      • A special type of record data, where
        • each record (transaction) involves a set of items.
        • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
      Transaction Data
      • A Generic Graph
      Graph Data
      • Benzene Molecule: C 6 H 6
      Chemical Data
      • Sequences of transactions
      Ordered Data An element of the sequence Items/Events
      • Genomic Sequences
      Ordered Data
      • Spacio Temporal Data
      Ordered Data Average Monthly Temperature of land and ocean
      • What kinds of data quality problems?
      • How can we detect problems with the data?
      • What can we do about these problems?
      • Examples of data quality problems:
        • Noise and outliers
        • missing values
        • duplicate data
      Data Quality
      • Noise refers to modification of original values
        • Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
      Noise Two Sine Waves Two Sine Waves + Noise
      • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
      Outliers
      • Reasons for missing values
        • Information is not collected (e.g., people decline to give their age and weight)
        • Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
      • Handling missing values
        • Eliminate Data Objects
        • Estimate Missing Values
        • Ignore the Missing Value During Analysis
        • Replace with all possible values (weighted by their probabilities)
      Missing Values
      • Data set may include data objects that are duplicates, or almost duplicates of one another
        • Major issue when merging data from heterogeous sources
      • Examples:
        • Same person with multiple email addresses
      • Data cleaning
        • Process of dealing with duplicate data issues
      Duplicate Data
      • 1. Data that is not used cannot be correct for very long
      • 2. Data Quality in an information system is a function of its use, not its collection
      • 3.Data quality will ultimately be no better than its most stringent use
      • 4. Data quality problems tend to become worse with the age of the system
      • 5. Less likely it is that some data element will change, more traumatic it will be when it finally does change.
      • 6. Information overload affects data quality
      Six Rules of Data Quality
    • Takeaways What is Data Mining? Challenges and Trends The origin of DM Data Mining Tasks Data Types Data Quality