Data Mining

CS57300 / STAT 59800-024
Purdue University

January 13, 2009




                           1




Introductio...
What is data mining?


     “... the non-trivial extraction of implicit, previously
     unknown, and potentially useful i...
Why now?




                       http://images.forbes.com/images/forbes/2004/1213/113_chart.gif

                      ...
Knowledge
Data mining process            Discovery in
                       Databases: Process
                          ...
Data mining process

5.Data mining:                       6.Interpretation/evaluation

  • Choose task (e.g.,             ...
Example rule (1)
        Example problem




                          Example




                                    11
...
How did you devise rules?

• Look for characteristics of one set but not the other?

• Reject potential rules that didn’t ...
Complexities

• Data size: vastly larger or changing rapidly

• Data representation: can affect ability to learn and inter...
Goals

• Identify key elements of data mining systems and the knowledge
  discovery process

• Understand how algorithmic ...
Logistics


• Time and location: TTh 12:00-1:15, Lawson 1106

• Instructor: Jennifer Neville
  neville@cs.purdue.edu, LWSN...
Workload

• Readings

   • Principles of Data Mining,
     Hand, Mannila, and Smyth, MIT Press, 2001.

   • Additional rea...
Project

• Goal: Experience the process of data mining

  • Data preparation

  • Task definition

  • Feature identificatio...
Grading



                                        5%
                               25%           10%




               ...
Next class

• Reading: Chapter 1 PDM

• Topic: Elements of data mining algorithms




                                    ...
Upcoming SlideShare
Loading in...5
×

Data Mining

289

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
289
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Mining

  1. 1. Data Mining CS57300 / STAT 59800-024 Purdue University January 13, 2009 1 Introduction • What is data mining? • Why now? • Data mining process • Example 2
  2. 2. What is data mining? “... the non-trivial extraction of implicit, previously unknown, and potentially useful information from data.” Frawley, Piatetsky-Shapiro, and Matheus (1992) “... a new paradigm that focuses on computerized exploration of large amounts of data and on discovery of relevant and interesting patterns within them.” Feldman and Dagan (1995) 3 What is data mining? Statistics Data Mining Databases Visualization Artificial Intelligence Also known as: knowledge discovery, exploratory data analysis, applied statistics, machine learning 4
  3. 3. Why now? http://images.forbes.com/images/forbes/2004/1213/113_chart.gif 5 How much information? Lyman and Varian, UCBerkeley (2003) • ~5 exabytes of new information stored in 2002 • 1 Exabyte = 1000 petabytes = 1 mil terabytes = 1 bil gigabytes • The amount of new information stored has about doubled in the last three years • Almost 18 exabytes of information flowed through electronic channels in 2002 • 98% percent of this total is the information sent and received in telephone calls 6
  4. 4. Knowledge Data mining process Discovery in Databases: Process Interpretation/ Evaluation Data Mining Knowledge Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press CS590D 13 7 6 Data mining process 1.Application setup: 3.Data preprocessing • Acquire relevant domain • Remove noise or outliers knowledge • Handle missing values • Assess user goals • Account for time or other 2.Data selection changes • Choose data sources 4.Data transformation • Identify relevant attributes • Find useful features • Sample data • Reduce dimensionality 8
  5. 5. Data mining process 5.Data mining: 6.Interpretation/evaluation • Choose task (e.g., • Assess accuracy of model/ classification, regression, results clustering) • Interpret model for end-users • Choose algorithms for learning and inference • Consolidate knowledge • Set parameters Example 7.Repeat... • Apply algorithms to search for patterns of interest 9 Example Example problem 10 6
  6. 6. Example rule (1) Example problem Example 11 6 Example rule (2) Example problem 12 6
  7. 7. How did you devise rules? • Look for characteristics of one set but not the other? • Reject potential rules that didn’t cover enough examples? • Examine several potential rules? • Consider simple rules first? 13 This is data mining... • Data representation: Describe the data • Task specification: Outline the goal(s) • Knowledge representation: Describe the rules • Learning technique: • Search: Identify a rule • Evaluation function: Estimate confidence • Inference technique: Apply the rule • Data mining system: Do above in combination 14
  8. 8. Complexities • Data size: vastly larger or changing rapidly • Data representation: can affect ability to learn and interpret models • Knowledge representation: needs to capture more subtle forms of probabilistic dependence • Search space: vastly larger • Evaluation functions: difficult to assess confidence in model utility 15 Course overview 16
  9. 9. Goals • Identify key elements of data mining systems and the knowledge discovery process • Understand how algorithmic elements interact • Recognize various types of data mining tasks • Familiarity with standard models/algorithms • Implement and apply basic algorithms • Understand how to evaluate performance 17 Topics • Elements of data mining algorithms • Data preparation and exploration • Statistical foundations • Predictive modeling • Descriptive modeling • Pattern mining and anomaly detection • Data mining systems • Current research topics (as time permits) 18
  10. 10. Logistics • Time and location: TTh 12:00-1:15, Lawson 1106 • Instructor: Jennifer Neville neville@cs.purdue.edu, LWSN 2142D, office hours: By appt • Teaching assistant: Rongjing Xiang rxiang@cs.purdue.edu, LWSN B116, office hours: TBA • Webpage: http://www.cs.purdue.edu/~neville/courses/CS573.html • Email list: spring2009-cs-57300-001@lists.purdue.edu • Prerequisites: introductory statistics course (e.g., STAT 516), basic programming skills (e.g., CS381, STAT598G) 19 Registration • The course is currently at capacity • Waiting list procedure: • Go to CS department website Courses Registration • Fill out form to request registration (due to lack of space) • Use CS course number 57300, if you want to register for ST59800-024 include “(STAT)” afterward • Students will be admitted on a first-come, first-serve basis as space becomes available 20
  11. 11. Workload • Readings • Principles of Data Mining, Hand, Mannila, and Smyth, MIT Press, 2001. • Additional readings: TBA • Homeworks • Five homeworks including written exercises, programming assignments, analysis in R • Late policy: 10% off per day late, maximum of 5 days; four extension days can be applied anytime during semester 21 Workload (cont.) • Review report • Select three recent conference papers on a subtopic of data mining/ machine learning • Write a 3 page report outlining the main ideas of papers and relate them to the context of the course • Project (details to follow) • Exams • Midterm exam: Mar 10, 8-10pm Lawson B151 • Final exam: No final exam, CS qualifying exam will be cover full semester 22
  12. 12. Project • Goal: Experience the process of data mining • Data preparation • Task definition • Feature identification/construction • Algorithm selection, parameter tuning • Experimental application and evaluation • Iterative refinement 23 Project (cont.) • Focus of project • Choose data, apply !2 existing methods and compare performance, explore reasons for underlying performance • Choose data, formulate novel task, design new method or extend existing methods, evaluate on data • Hypothesis testing: You must formulate and test at least two specific hypotheses in your project. • Project proposal: due Feb 17 • Final report: due Apr 30 24
  13. 13. Grading 5% 25% 10% 35% 25% Participation Presentation Homework Midterm Project 25 Acknowledgements • Some of the lecture material is drawn from other data mining courses, which use PDM: • Professor David Jensen at UMass Amherst (CS591Y) • Professor Lise Getoor at UMaryland College Park (CS828G) 26
  14. 14. Next class • Reading: Chapter 1 PDM • Topic: Elements of data mining algorithms 27

×