Your SlideShare is downloading. ×
  • Like
Data Mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
263
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Mining CS57300 / STAT 59800-024 Purdue University January 13, 2009 1 Introduction • What is data mining? • Why now? • Data mining process • Example 2
  • 2. What is data mining? “... the non-trivial extraction of implicit, previously unknown, and potentially useful information from data.” Frawley, Piatetsky-Shapiro, and Matheus (1992) “... a new paradigm that focuses on computerized exploration of large amounts of data and on discovery of relevant and interesting patterns within them.” Feldman and Dagan (1995) 3 What is data mining? Statistics Data Mining Databases Visualization Artificial Intelligence Also known as: knowledge discovery, exploratory data analysis, applied statistics, machine learning 4
  • 3. Why now? http://images.forbes.com/images/forbes/2004/1213/113_chart.gif 5 How much information? Lyman and Varian, UCBerkeley (2003) • ~5 exabytes of new information stored in 2002 • 1 Exabyte = 1000 petabytes = 1 mil terabytes = 1 bil gigabytes • The amount of new information stored has about doubled in the last three years • Almost 18 exabytes of information flowed through electronic channels in 2002 • 98% percent of this total is the information sent and received in telephone calls 6
  • 4. Knowledge Data mining process Discovery in Databases: Process Interpretation/ Evaluation Data Mining Knowledge Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press CS590D 13 7 6 Data mining process 1.Application setup: 3.Data preprocessing • Acquire relevant domain • Remove noise or outliers knowledge • Handle missing values • Assess user goals • Account for time or other 2.Data selection changes • Choose data sources 4.Data transformation • Identify relevant attributes • Find useful features • Sample data • Reduce dimensionality 8
  • 5. Data mining process 5.Data mining: 6.Interpretation/evaluation • Choose task (e.g., • Assess accuracy of model/ classification, regression, results clustering) • Interpret model for end-users • Choose algorithms for learning and inference • Consolidate knowledge • Set parameters Example 7.Repeat... • Apply algorithms to search for patterns of interest 9 Example Example problem 10 6
  • 6. Example rule (1) Example problem Example 11 6 Example rule (2) Example problem 12 6
  • 7. How did you devise rules? • Look for characteristics of one set but not the other? • Reject potential rules that didn’t cover enough examples? • Examine several potential rules? • Consider simple rules first? 13 This is data mining... • Data representation: Describe the data • Task specification: Outline the goal(s) • Knowledge representation: Describe the rules • Learning technique: • Search: Identify a rule • Evaluation function: Estimate confidence • Inference technique: Apply the rule • Data mining system: Do above in combination 14
  • 8. Complexities • Data size: vastly larger or changing rapidly • Data representation: can affect ability to learn and interpret models • Knowledge representation: needs to capture more subtle forms of probabilistic dependence • Search space: vastly larger • Evaluation functions: difficult to assess confidence in model utility 15 Course overview 16
  • 9. Goals • Identify key elements of data mining systems and the knowledge discovery process • Understand how algorithmic elements interact • Recognize various types of data mining tasks • Familiarity with standard models/algorithms • Implement and apply basic algorithms • Understand how to evaluate performance 17 Topics • Elements of data mining algorithms • Data preparation and exploration • Statistical foundations • Predictive modeling • Descriptive modeling • Pattern mining and anomaly detection • Data mining systems • Current research topics (as time permits) 18
  • 10. Logistics • Time and location: TTh 12:00-1:15, Lawson 1106 • Instructor: Jennifer Neville neville@cs.purdue.edu, LWSN 2142D, office hours: By appt • Teaching assistant: Rongjing Xiang rxiang@cs.purdue.edu, LWSN B116, office hours: TBA • Webpage: http://www.cs.purdue.edu/~neville/courses/CS573.html • Email list: spring2009-cs-57300-001@lists.purdue.edu • Prerequisites: introductory statistics course (e.g., STAT 516), basic programming skills (e.g., CS381, STAT598G) 19 Registration • The course is currently at capacity • Waiting list procedure: • Go to CS department website Courses Registration • Fill out form to request registration (due to lack of space) • Use CS course number 57300, if you want to register for ST59800-024 include “(STAT)” afterward • Students will be admitted on a first-come, first-serve basis as space becomes available 20
  • 11. Workload • Readings • Principles of Data Mining, Hand, Mannila, and Smyth, MIT Press, 2001. • Additional readings: TBA • Homeworks • Five homeworks including written exercises, programming assignments, analysis in R • Late policy: 10% off per day late, maximum of 5 days; four extension days can be applied anytime during semester 21 Workload (cont.) • Review report • Select three recent conference papers on a subtopic of data mining/ machine learning • Write a 3 page report outlining the main ideas of papers and relate them to the context of the course • Project (details to follow) • Exams • Midterm exam: Mar 10, 8-10pm Lawson B151 • Final exam: No final exam, CS qualifying exam will be cover full semester 22
  • 12. Project • Goal: Experience the process of data mining • Data preparation • Task definition • Feature identification/construction • Algorithm selection, parameter tuning • Experimental application and evaluation • Iterative refinement 23 Project (cont.) • Focus of project • Choose data, apply !2 existing methods and compare performance, explore reasons for underlying performance • Choose data, formulate novel task, design new method or extend existing methods, evaluate on data • Hypothesis testing: You must formulate and test at least two specific hypotheses in your project. • Project proposal: due Feb 17 • Final report: due Apr 30 24
  • 13. Grading 5% 25% 10% 35% 25% Participation Presentation Homework Midterm Project 25 Acknowledgements • Some of the lecture material is drawn from other data mining courses, which use PDM: • Professor David Jensen at UMass Amherst (CS591Y) • Professor Lise Getoor at UMaryland College Park (CS828G) 26
  • 14. Next class • Reading: Chapter 1 PDM • Topic: Elements of data mining algorithms 27