Your SlideShare is downloading. ×
Dm week01 intro.handout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dm week01 intro.handout

335

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
335
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Christof Monz Informatics Institute University of Amsterdam Data Mining Week 1: Introduction Today’s Class Christof Monz Data Minging - Week 1: Introduction 1 Overview of Data Mining Overview of Machine Learning Course administrivia
  • 2. What’s Data Mining? Christof Monz Data Minging - Week 1: Introduction 2 Data: Records, web pages, documents, etc. Mining: The process or business of extracting ore or minerals from the ground (The American Heritage) Data Mining: The nontrivial extraction of implicit, previously unknown, and potentially useful information from large amounts of data Why Data Mining? Christof Monz Data Minging - Week 1: Introduction 3 There is an abundance of data resources: commercial databases, intranets, the Internet, . . . These resources contain a large amount of valuable data The best way to structure the data depends on how one wants to exploit it Manual data organization is very laborious and expensive There is a need to automate this process
  • 3. Some Application Areas Christof Monz Data Minging - Week 1: Introduction 4 Customer analysis (what impacts customer behavior?) Medical research (what is the impact of lifestyle/drug effects?) Insurance (risk assessment) Stock investment (which factors impact stock performance?) Fraud detection (when is a transaction likely to be fraudulent?) The Need for Automated Analysis Christof Monz Data Minging - Week 1: Introduction 5 Much of the available data is never analyzed!
  • 4. What is and isn’t Data Mining Christof Monz Data Minging - Week 1: Introduction 6 Look up in an electronically available phone book what John Doe’s phone number and address is (isn’t Data Mining but database management) Infer from analyzing a number of web pages what John Doe’s phone number is, although this information is not expressed explicitly (is Data Mining) Situating Data Mining Christof Monz Data Minging - Week 1: Introduction 7 Data Mining lies on the intersection of a number of research areas
  • 5. Data Mining Tasks Christof Monz Data Minging - Week 1: Introduction 8 Prediction • Use some variables to predict unknown or future values of other variables Description • Find human-interpretable patterns that describe the data Some Data Mining Tasks Christof Monz Data Minging - Week 1: Introduction 9 Classification (Predictive) Clustering (Descriptive) Association Rule Discovery (Descriptive) Sequential Pattern Discovery (Descriptive) Regression (Predictive) Deviation Detection (Predictive)
  • 6. Classification Christof Monz Data Minging - Week 1: Introduction 10 Given a collection of records (training set) • Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model Example: Direct Marketing Christof Monz Data Minging - Week 1: Introduction 11 Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product Approach: • Use the data for a similar product introduced before • We know which customers decided to buy and which decided otherwise. This buy/don’t buy decision forms the class attribute • Collect various demographic, lifestyle, and company-interaction related information about all such customers (where they stay, how much they earn, . . . ) • Use this information as input attributes to learn a classifier model
  • 7. Classify This! Christof Monz Data Minging - Week 1: Introduction 12 Some Observations Christof Monz Data Minging - Week 1: Introduction 13 Training data (examples for which the class is known) Feature extraction (what are the ’things’ that are relevant to predict a class?) Feature weight (how important is a feature?) Feature combination (sometimes features act together) Over-fitting (some features don’t generalize well) Evaluation (how accurate is the prediction?)
  • 8. Machine Learning Christof Monz Data Minging - Week 1: Introduction 14 The research area of machine learning investigates and formalizes the challenge of prediction and description by computer Machine learning plays a central role in data mining It is used for: • Building new models • Adapting existing models to new situations • Comparing the performance of competing models Machine Learning is . . . Christof Monz Data Minging - Week 1: Introduction 15 . . . the principles, methods, and algorithms for learning and prediction on the basis of past experience . . . already everywhere: speech recognition, hand-written character recognition, computer vision, information retrieval, operating systems, compilers, fraud detection, security, defense applications, . . .
  • 9. Learning Christof Monz Data Minging - Week 1: Introduction 16 Steps • entertain a (biased) set of possibilities • adjust predictions based on feedback • rethink the set of possibilities Principles of learning are ‘universal’ • society (e.g., scientific community) • animal (e.g., human) • machine Learning and Prediction Christof Monz Data Minging - Week 1: Introduction 17 We make predictions all the time but rarely investigate the processes underlying our predictions In carrying out scientific research we are also governed by how theories are evaluated To automate the process of making predictions we need to understand in addition how we search and refine ‘theories’
  • 10. Learning: Key Steps Christof Monz Data Minging - Week 1: Introduction 18 Data and assumptions • What data is available for the learning task? • What can we assume about the problem? Representation • How should we represent the examples to be classified? Evaluation and Estimation • How well are we doing? • How do we adjust our predictions based on the feedback? • Can we rethink the approach to do even better? Example Christof Monz Data Minging - Week 1: Introduction 19 A classification problem: predict the grades for students taking this course Key Steps: 1. data 2. assumptions 3. representation 4. estimation 5. evaluation 6. model selection
  • 11. Example Christof Monz Data Minging - Week 1: Introduction 20 Key Steps: 1. data: what ‘past experience’ can we rely on? 2. assumptions: what can we assume about the students or the course? 3. representation: how do we ‘summarize’ a student? 4. estimation: how do we construct a map from students to grades? 5. evaluation: how well are we predicting? 6. model selection: perhaps we can do even better? Example: Data Christof Monz Data Minging - Week 1: Introduction 21 The data we have available (in principle): • Names and grades of students in past years ML courses • Academic record of past and current students Training data: Student ML course 1 course 2 . . . Peter A B A . . . David B A A . . . Test data: Student ML course 1 course 2 . . . Jack ? C A . . . Kate ? A A . . .
  • 12. Assumptions Christof Monz Data Minging - Week 1: Introduction 22 There are many assumptions we can make to facilitate predictions: • The course has remained roughly the same over the years • Each student performs independently from others Example: Representation Christof Monz Data Minging - Week 1: Introduction 23 Academic records are rather diverse so we might limit the summaries to a select few courses For example, we can summarize the ith student (say David) with a vector: xi = [B A A] The available data in this representation: Training Testing Student ML grade Student ML grade x1 A x1 ? x2 B x2 ? . . . . . . . . . . . .
  • 13. Example: Estimation Christof Monz Data Minging - Week 1: Introduction 24 Given the training data Student ML grade x1 A x2 B . . . . . . find a mapping from input vectors x to ‘labels’ y encoding the grades for the ML course. Possible solution (nearest neighbor classifier): 1. For any student x in the test set find the ‘closest’ student xi in the training set 2. Predict yi as the grade of the closest student Example: Evaluation Christof Monz Data Minging - Week 1: Introduction 25 How can we tell how good our predictions are? • We can wait till the end of this course • We can try to assess the accuracy based on the data we already have (part of the training data) Possible solution: • Divide the training set further into training and test sets • Evaluate the classifier constructed on the basis of only the smaller training set on the new test set
  • 14. Example: Model Selection Christof Monz Data Minging - Week 1: Introduction 26 We can refine • the estimation algorithm (e.g., using a classifier other than the nearest neighbor classifier) • the representation (e.g., base the summaries on a different set of courses) • the assumptions (e.g., perhaps students work in groups) etc. We have to rely on the method of evaluating the accuracy of our predictions to select among the possible refinements Types of Learning Approaches Christof Monz Data Minging - Week 1: Introduction 27 Supervised learning: where we get a set of training inputs and outputs • E.g., classification, regression Unsupervised learning: where we are interested in capturing inherent organization in the data • E.g., clustering, density estimation Reinforcement learning: where we only get feedback in the form of how well we are doing (not what we should be doing) • E.g., planning
  • 15. Challenges of Data Mining Christof Monz Data Minging - Week 1: Introduction 28 Scalability Dimensionality/Complexity Data quality Data ownership Privacy considerations Continually updated data Recap Christof Monz Data Minging - Week 1: Introduction 29 Difference between data mining and other research areas Applications of data mining Need for automation and the use of machine learning Key steps in machine learning
  • 16. About This Course Christof Monz Data Minging - Week 1: Introduction 30 This course does not: • give a comprehensive introduction to data mining • cover how to adapt data mining to specific applications • cover feature extraction • cover evaluation issues in detail This course does: • focus on the pre-dominant approach in data mining: machine learning • sketch some of the example applications • introduce a representative selection of machine learning techniques used in data mining • focus on the algorithmic fundamentals of machine learning Approaches Covered Christof Monz Data Minging - Week 1: Introduction 31 Linear regression (regression) Decision Trees (classification) Neural Networks (classification) k-Nearest-Neighbors (classification) Naive Bayes (classification) K-Means (clustering) Hierarchical Clustering (clustering)
  • 17. What to get out of this Course Christof Monz Data Minging - Week 1: Introduction 32 At the end of this course you will have learned: • what type of problems can be addressed by data mining techniques • what the most common machine learning approaches in data mining are • which machine learning approaches are appropriate for a given type of data mining application • the algorithmic fundamentals of a number of relevant machine learning approaches Course Administrivia Christof Monz Data Minging - Week 1: Introduction 33 Exam counts for 40%, homework counts for 20%, practical assignments (40%) Lectures are on Tuesday 9-11am (D1.116) Tutorials (werk colleges) are on Thursday 9-11am (G0.05) and Fridays 9-11am (G5.29) Labs are on Thursday 11am-1pm (G0.18) or Friday 11am-1pm (G0.18)
  • 18. Course Administrivia Christof Monz Data Minging - Week 1: Introduction 34 Teaching assistants: Yijin He (email: jiyinhe@gmail.com) (English only!) Spyros Martzoukos (email: S.Martzoukos@uva.nl) (English only!) Course web page: on Blackboard Check course web page regularly for announcements, slides, . . .

×