Chapter 1 : Introduction to KDD
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Chapter 1 : Introduction to KDD

on

  • 1,041 views

 

Statistics

Views

Total Views
1,041
Views on SlideShare
1,040
Embed Views
1

Actions

Likes
0
Downloads
17
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Chapter 1 : Introduction to KDD Presentation Transcript

  • 1. Chapter 1 : Introduction to KDD
  • 2. What is Knowledge Acquisitions ?
    • aka :: data mining , knowledge discovery, knowledge extraction, information discovery, information harvesting ect.
    • Process of discovering useful information,hidden pattern or rules in large quantities of data ( non-trivial, unknown data)
    • By automatic or semiautomatic means
    • It’s impossible to find pattern using manual method.
  • 3. Why Knowledge Acquisitions ?
  • 4. Why Knowledge Acquisitions ?
    • Why?
      • Data explosion (tremendous amount of data available)
      • Data is being warehoused
      • Computing power
      • Competitive pressure
    Hard Disk Nowadays more than 100Ggbytes capacities
  • 5. Is Data Mining Appropriate for My problem ?
    • Four general question to consider
      • Can we clearly define the problem ?
      • Does potentially meaningful data exist?
      • Does the data contain hidden knowledge or is the data factual and useful for reporting purpose only?
      • Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining project.
  • 6. Traditional Approaches
    • Traditional database queries :. Access a database using a well defined query such as SQL
    • The query output consist of data from database
    • The output usually a subset of the database
    DBMS DB SQL
  • 7. Data Mining or Data Query
    • Four general types of knowledge can be define to help us determine when data mining is appropriate.
      • Shallow Knowledge
      • Multidimensional Knowledge
      • Hidden Knowledge
      • Deep Knowledge
  • 8. Shallow Knowledge
    • Factual in nature
    • Can be easily stored and manipulated in a database
    • Database query language such as SQL are excellent tools for extracting shallow knowledge from data
  • 9. Multidimensional Knowledge
    • also Factual
    • Data are stored in a multidimensional format
    • On-line Analytical Processing (OLAP) tools are used on multidimensional data
  • 10. Hidden Knowledge
    • Patterns or regularities in data that cannot be easily found using database query language such as SQL
    • Data mining algorithms can find such patterns with ease.
  • 11. Deep Knowledge
    • Knowledge stored in database that can only be found if we are given some direction about what we are looking for.
    • Current data mining tools are not able to locate deep knowledge.
  • 12. What can computers learn?
    • Four level of learning can be differentiated (Merril & Tennyson, 1977) :
      • Facts : simple statement of truth
      • Concepts : set of objects, symbols, or events grouped together because they share certain characteristics
      • Procedures : step by step course of action to achieve a goal.
      • Principles : highest level of learning. General truth or laws that are basic to other truths.
  • 13. What can computers learn?
    • Computer are good at learning ‘ concepts ’.
    • Concepts are the output of data mining session.
    • There are three (3) common concept view: a. Classical view b. Probabilistic view c. Exemplar View
  • 14. Three Concept Views
    • Classical View:
    • Definite defining properties
    • These properties determine if an individual item is an example of a particular concept.
    • Crisp and leaves no room for misinterpretation.
    • Example: Good Credit Rating
    IF Annual Income >= 30,000 & Years at Current Position >= 5 & Owns Home = True THEN Good Credit Risk = True
  • 15. Three Concept Views
    • b. Probabilistic View:
    • Concepts are represented by properties that are probable of concept member.
    • Assumption is that people store and recall concept as generalization created from individual instance observation.
    • Cannot be directly applied to achieve answer – but can be used to help in decision making process.
    • Associate probability of membership with a specific classification.
  • 16. Three Concept Views - The mean annual income for individuals who consistently make loan payments on time is $30,000 - Most individuals who are good credit risks have been working for the same company for at least five years. - The majority of good credit risks own their own home
    • b. Probabilistic View:
    • Example: Good Credit Rating
    Home owner with an annual income of $27000, employed at the same position for 4 years might be classified as a good credit risk with a probability of 0.85
  • 17. Three Concept Views
    • c. Exemplar View:
    • A given instance is determine to be an example of a particular concept if the instance is similar enough to a set of one or more known examples of the concept .
    • Assumption is that people store and recall likely concept exemplars that are then used to classify new instances.
    • Can associate a probability of concept membership with each classification.
  • 18. Three Concept Views
    • c. Exemplar View:
    • Example:
    Exemplar #1: Annual Income = 32,000 Number of years at current position = 6 Homeowner Exemplar #2: Annual Income = 52,000 Number of years at current position = 16 Renter Exemplar #1: Annual Income = 28,000 Number of years at current position = 12 Homeowner
  • 19. What can be mined?
  • 20. Concepts that can be mined?
    • a. Classes :
    • stored data is used to locate data in predetermined groups.
    • Eg: A restaurant chain could mine customer purchase data to determine when customers visit and what they typically order.
  • 21. Concepts that can be mined?
    • b. Clusters :
    • Data items are grouped by logical relationships.
    • Eg: Data can be mined to identify market segments or customer affinities.
  • 22. Concepts that can be mined?
    • c. Associations :
    • Data can be mined to identify association.
    • Eg: The beer-diaper example is typical of associative mining.
  • 23. Concepts that can be mined?
    • d. Sequential :
    • Patterns in which data is mined to anticipate behavior patterns and trends.
    • Eg: An outdoor equipment retailer could predict the likelihood of a backpack purchase based on sleeping bag or hiking shoes sale.
  • 24. Multidisciplinary Databases Statistics Pattern Recognition KDD Machine Learning AI Neurocomputing Data Mining
  • 25. Disciplines Of Data Mining Data Mining Information Retrival Algorithm Machine Learning Visualization Statistics Database System
  • 26. Data Mining Model & Task Data Mining Predictive Descriptive
    • Classification
    • Regression
    • Time Series Analysis
    • Prediction
    • Clustering
    • Summarization
    • Association Rules
    • Sequence Discovery
  • 27. Predictive Model
    • Make prediction about values of data using known results found from different data
    • Or based on the use of other historical data
    • Example:: credit card fraud, breast cancer early warning, terrorist act, tsunami and ect.
  • 28. Predictive Model
    • Perform inference on the current data to make predictions.
    • We know what to predict based on historical data)
    • Never accurate 100%
    • Concentrate more to input output relation ship ( x,f(x))
    • Typical Question
      • Which costumer are likely to buy this product next four month
      • What kind of transactions that are likely to be fraudulent
      • Who is likely to drop this paper?
  • 29. Predictive Model x x x x x x x x x x x x x x x x months Profit (RM) Current data Future data O ?
  • 30. Descriptive Model
    • Identifies pattern or relationships in data.
    • Serves as a way to explore the properties of data examined, not to predict new properties
    • Always required a domain expert
    • Example::
      • Segmenting marketing area
      • Profiling student performances
  • 31. Descriptive Model
    • Discovering new patterns inside the data
    • We may don’t have any idea how the data looks like
    • Explores the properties of the data examined
    • Pattern at various granularities (eg: Student: University-> faculty->program-> major?
    • Typical Question
      • What is the data
      • What does it look like
      • What does the data suggest for group of customer advertisement?
  • 32. Descriptive Model major Results x x x x x x x x x x o o o o o o o o o o o o o o o o y y y y y y y y y y y y y y y Group 1 Group 2 Group 3
  • 33. View Of DM
    • Data To Be Mined
      • Data warehouse, WWW, time series, textual. spatial multimedia, transactional
    • Knowledge To Be Mined
      • Classification, prediction, summarization, trend
    • Techniques Utilized
      • Database, machine learning, visualization, statistics
    • Applications Adapted
      • Marketing, demographic segmentation, stock analysis
  • 34. DM In Action
    • Medical Applications ::clinical diagnosis, drug analysis
    • Business (marketing segmentation & strategies, insolvency predictor, loan risk assessment
    • Education (Online learning)
    • Internet (searching engine)
    • Etc.
  • 35. Data Mining Methodology
    • Hypothesis Testing vs Knowledge Discovery
      • Hypothesis Testing
        • Top down approach
        • Attempts to substantiate or disprove preconceived idea
      • Knowledge Discovery
        • Bottom-up approach
        • Start with data and tries to get it to tell us something we didn’t already know
  • 36. Data Mining Methodology
    • Hypothesis Testing
      • Generate good ideas
      • Determine what data allow these hypotheses to be tested
      • Locate the data
      • Prepare the data for analysis
      • Build computer models based on the data
      • Evaluate computer model to confirm or reject hypotheses
  • 37. Data Mining Methodology
    • Knowledge Discovery
      • Directed
        • Identified sources of pre classified data
        • Prepare data analysis
        • Select appropriated KD techniques based on data characteristics and data mining goal
        • Divide data into training, testing and evaluation
        • Use the training dataset to build model
        • Tune the model by applying it to test dataset
        • Take action based on data mining results
        • Measure the effect of the action taken
        • Restart the DM process taking advantage of new data generated by the action taken
  • 38. Data Mining Methodology
    • Knowledge Discovery
      • Undirected
        • Identified available data sources
        • Prepare data analysis
        • Select appropriated undirected KD techniques based on data characteristics and data mining goal
        • Use the selected technique to uncover hidden structure in the data
        • Identify potential targets for directed KD
        • Generate new hypothesis to test
  • 39. Question for Group Discussion
  • 40. Revision:: Two Approaches In data Mining Data Mining Predictive Descriptive
    • Classification
    • Regression
    • Time Series Analysis
    • Prediction
    • Clustering
    • Summarization
    • Association Rules
    • Sequence Discovery
    Predict the future value Define R/S among data
  • 41. Knowledge Discovery Process
  • 42. Knowledge Discovery Process
    • 1.0 Selection
      • The data needs for the data mining process may be obtained from many different and heterogeneous data sources
      • Examples
        • Business Transactions
        • Scientific Data
        • Video and pictures
  • 43. Knowledge Discovery Process
    • 2.0 Pre Processing
    • Main idea – to ensure that data is clean (high quality of data).
      • The data to be used by the process may have incorrect or missing data.
      • There may be anomalous data from multiple sources involving different data types and metrics
      • Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (Often using data mining tools)
  • 44. Knowledge Discovery Process
    • 3.0 Transformation
      • Data from different sources must be converted into a common format for processing
      • Some data may be encoded or transformed into more usable formats
      • Example::
        • Data Reduction Data Cleaning, Data Integration, Data Transformation, Data Reduction and Data Discretization
  • 45. Knowledge Discovery Process
    • 4.0 Data Mining
    • Main idea –to use intelligent method to extract patterns and knowledge from database
    • This step applies algorithms to the transformed data to generate the desired results .
    • The heart of KD process (where unknown pattern will be revealed).
    • Example of algorithms: Regression (classification, prediction), Neural Networks (prediction, classification, clustering), Apriori Algorithms (association rules), K-Means & K-Nearest Neighbor (clustering), Decision Tree (classification), Instance Learning (classification).
  • 46. Knowledge Discovery Process
    • 5.0 Interpretation/Evaluation
      • How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent on it
      • Example::
        • Graphical
        • Geometric
        • Icon Based
        • Pixel Based
        • Hierarchical Based
        • Hybrid
  • 47. Case Study: Predicting FSK Final Year’s Student Performance activities Student database {contains 30,000 records} Academics academics Selected record {matric, PMK, grades} – only 2,000 records (contains incomplete records etc. Selection academics Clean record {replace the missing value, removed the replicated} Pre-processing Using neural networks : transform into numerical. Transformation Y=w1x1+w2x2+b1 Generated Model : pattern for performance prediction Data mining Testing result: 90 % correct  accept model Knowledge (apply model) Interpretation & evaluation