STAT 4310

553 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
553
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

STAT 4310

  1. 1. Σtatistics αt KΣU PROPOSAL DOCUMENT NEW CLASS: STATISTICAL DATA MINING IN MINOR IN STATISTICS AND DATA ANALYSIS DEPARTMENT OF MATHEMATICS AND STATISTICS KENNESAW STATE UNIVERSITY COORDINATED BY XUELEI (SHERRY) NI ASSISTANT PROFESSOR OF STATISTICS xni2@kennesaw.edu JENNIFER LEWIS PRIESTLEY ASSISTANT PROFESSOR OF STATISTICS jpriestl@kennesaw.edu 1
  2. 2. Table of Contents Page New Course Proposal Form 2 Syllabus 8 Weekly Schedule 12 Signatures 13 2
  3. 3. KENNESAW STATE UNIVERSITY UNDERGRADUATE PROPOSAL New Course (NOT General Education) I. Proposed Information Course Prefix and Number: STAT4310 Course Title: Statistical Data Mining Credit Hours (format should be # - # - #): 3 - 0 - 3 Prerequisites: Stat 3130 (Prerequisites are courses or requirements that non-negotiable and must be successfully completed by any student before enrolling in the course or program under consideration. Corequisites are courses that can be taken before or in the same semester as the course under consideration. Courses at the upper-division level will require lower-division competencies or prerequisites.) Course Description for the Catalog: Data Mining is an information extraction activity whose goal is to discover hidden facts contained in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. The process includes data selection, cleaning, coding, using different statistical, pattern recognition and machine learning techniques, and reporting and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples of practical applications. The students will be encouraged to use recent Data Mining software. II. Justification for Course A. Explain assessment findings which led to course development. As with the other courses in the Minor in Applied Statistics and Data Analysis, Statistical Data Mining, was developed after interviews were conducted with faculty members from across KSU. Statistical Data Mining will cover the major principles and techniques used in data mining, which could be applied in many areas. According to the google search with keywords "Data Mining", tons of applications can be found in 1) science (astronomy, bioinformatics, dug discovery,Ö); 2) business (advertising, customer modeling and customer relationship management, e- commerce, fraud detection, health care, investments, manufacturing, sports/enterainment, telecom, target marketing,Ö) 3) web (search engines, bots, Ö) 4) government (anti-terrorism efforts, profiling tax cheaters,Ö) The principles and techniques covered in this course were identified as those modeling concepts most needed by students as they either pursue professions in their respective disciplines immediately after graduation or if they choose to pursue a graduate degree. B. Explain for Prerequisites: 3
  4. 4. 1. What is the substance of content in each prerequisite that commands its inclusion as a prerequisite to the proposed course? The pre-requisites for STAT4310 include STAT3120 and STAT3130, Statistical Methods I and II, respectively. Statistical Data Mining will draw heavily from the base statistical concepts taught in the Methods courses. STAT3010 is also a pre-requisite for STAT4310. STAT3010 is the basic statistical software class. Students in Data Mining need these software techniques to implement the algorithms learned in Data Mining. 2. What is the desired sequence of prerequisites? Students will have taken STAT3010 (Statistical Computing) prior to STAT3120 and STAT3130. STAT3120 and STAT3130 are Statistical Methods I and II, respectively, so STAT3120 should be taken prior to STAT3130. 3. What is the rationale for requiring the above sequence of prerequisites? Since statistical packages are heavily used to do statistical analysis, students need the skills developed in STAT3010 to realize the statical methods learned in the later courses, including the new Statistical Data Mining course. 4. How often are the required prerequisites offered? STAT3010 is offered every semester. STAT3120 is offered every fall semester and STAT3210 is offered every spring semester. C. Give any other justification for the course.       III. Additional Information A. Where does this course fit sequentially and philosophically within the program of study. Statistics is an inherently interdisciplinary topic. For this reason, the Department of Mathematics and Statistics sought input regarding the content and execution of the Minor in Applied Statistics and Data Analysis from a cross-section of departments, including: Psychology, Marketing, Nursing, Political Science, Economics, Information Systems, Biology and Chemistry. This course could be taken as a free/inter-disciplinary elective to complement any program of study on campus. B. What efforts have been made to ensure that this course does not duplicate the content of other college courses with similar titles, purposes, or content? Interviews were conducted with faculties from other departments. During the course of these inter-disciplinary interviews, it was found that there are two courses related with Data Mining being or will be offered in CSIS (Computer Science and Information System) major -- CSIS4490 (Data Warehousing and Business Intelligence, will be offered in 2007) and CSIS4491 (Modern Information Retrieval, being offered). After the meeting with the faculties from CSIS, it was also found that CSIS4490/4491 and STAT4310 have different emphasis on data mining. CSIS4490 and CSIS4491 more focus on the real application of data mining in areas such as data warehousing, business intelligence, and web searching, while STAT4310 will focus more on the underlying statistical principle. It was determined that STAT4310 complements CSIS4490/4491. Students can learn the basic statistical principles and techniques in STAT4310 and then see the real application of these principles and techniques, the limitations and advantages due to the 4
  5. 5. specific situations, in CSIS4490/4491. C. Where will the course be located in the program (elective, required in Area F, required or elective for the major)? Indicate and justify its placement in the curriculum. For most majors, this course will represent an upper level free or inter-disciplinary elective. D. How often will this course be offered? STAT4310 will initially be offered one semester per year. Additional offerings will be evaluated based upon demand. E. All sections of the course will be taught with the understanding that the following apply:       1. Purpose of the Course Instruct students in the basic concepts and techniques of Data Mining, and provide them hands- on experience in applying the concepts to real-world applications. 2. Objectives of the Course Upon completion of the course, students will be able to 1) Fully appreciate the concept of data as a strategic resource; 2) Understand how and when data mining can be used as a problem-solving technique; 3) Describe different methods of data mining; 4) Select an appropriate data mining technique for a specific problem; 5) Use existing data mining software to mine a prepared data set; 6) Describe the preprocessing, the analysis, and the results clearly in wring and orally; 7) Assess data analyses performed by others. 3. Course Content Topics include data preprocessing, linear models, linear discriminant analysis, logistic regression, classification tree, k-nearest neighbor, bayesian classifiers, overfitting and model evaluation, center-based cluster analysis, hierarchical cluster analysis, cluster validation, and association analysis. F. What instructional methodologies will be incorporated into the course to stimulate group process, writing skills, multiculturalism, and educational outcomes? The instruction in this course will consist of a mixture of traditional approach to introduce the theoretical aspects of the topics accompanied by in-class exercises as individuals and in groups, where individuals pursuing similar majors will be encouraged to work together. The students will use the appropriate software to analyze the data they have gathered, interpret the results, translate those results into a written document, and give an oral presentation in class. G. Outline the plan for continuous course assessment. What are the department, school, college, or professional standards which will be used for the assessment? How will it be determined that the course is current, meeting the educational needs of students and responsive to 5
  6. 6. educational standards? How often will the course assessment be done by the department? The success of the course will be evaluated within the context of the Minor in Applied Statistics and Data Analysis. Specifically, the Minor has been developed with a focus on two objectives: 1) provide graduates of KSU with a minor course of study that will enhance their degree, resulting in better job opportunities. 2) provide individuals who choose to pursue a master's or Ph.D. in their respective discipline, with the statistical background necessary to succeed in their respective programs. Evaluation of entire minor will take place after year 3, when it is expected that the first round of graduates will have entered into their profession or their next program of study. Interviews and questionnaires will be used to determine the extent to which the Minor in Applied Statistics and Data Analysis supported their pursuits. H. REQUIRED SYLLABUS CONTENTS (See Faculty Handbook on page 3.10 for details about KSU syllabi.) 1) Course Prefix Number and Title STAT4310 2) Instructor: Dr. Xuelei (Sherry) Ni Office: Science 462 Telephone: 678-797-2251 3) Learning Objectives Please see the attached syllabus. 4) Text(s) Please see the attached syllabus. 5) Course Requirements/Assignments Please see the attached syllabus. 6) Evaluation and Grading Please see the attached syllabus. 7) Weekly Schedule of Topics Please see the attached schedule. 8) Academic Honesty Statement Please see the attached syllabus. 9) Attendance Policy Please see the attached syllabus. IV. Resources and Funding Required A. What resources will be redirected to accommodate this course? Not applicable B. Explain what items will cause additional cost to the department/school/college. There are no incremental costs anticipated. 6
  7. 7. Personnel Computer Technology Library resources Equipment Space 7
  8. 8. V. COURSE MASTER FORM This form will be completed by the requesting department and will be sent to the Office of the Registrar once the course has been approved by the Office of the President. The form is required for all new courses. DISCIPLINE:  Statistics  COURSE NUMBER: STAT 4310 COURSE TITLE FOR LABEL:  Statistical Data Mining     (Note: Limit 30 spaces) CLASS-LAB-CREDIT HOURS:  3    -  0     -  3     Approval, Effective Semester: Fall  2007  Grades Allowed (Regular or S/U): Regular If course used to satisfy CPC, what areas?       Learning Support Programs courses which are required as prerequisites:       APPROVED: ______________________________________________________________________________ Vice President for Academic Affairs or Designee 8
  9. 9. Σtatistics αt KΣU Course: STAT 4310 Data Mining Instructor: Dr. Xuelei (Sherry) Ni Office: Science 462 Office Hours: By appointment Email: xni2@kennesaw.edu Course Pre-requisite: Basic knowledge of algebra, discrete math, college level calculus, and statistics. According to the semester system, you should have taken STAT3120 and STAT3130. If you are uncertain about your prerequisite knowledge for this class, please review the appendixes in the course textbook. Course Text (Suggested, not Required): • Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (2005). Introduction to Data Mining. Addison Wesley. ISBN: 0-321-32136-7. Website: http://www-users.cs.umn.edu/~kumar/ dmbook/index.php • Jiawei Han and Micheline Kamber (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. ISBN 1-55860-489-8 • Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlab. ISBN 0-387-95284-5 • David Hand, Heikki Mannila, and Padhraic Smyth (2001). Principles of Data Mining. The MIT Press. ISBN: 0-262-08290-X Course Software: This course will utilize SAS frequently. Some Matlab application will also be shown. Students are also encouraged to use any data-mining software package. This class will focus on the method, not the software. Students are supposed to become familiar with the software by themselves. Course Description: Data Mining is an information extraction activity whose goal is to discover hidden facts contained in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. The process includes data selection, cleaning, coding, using different statistical, pattern recognition and machine learning techniques, and reporting and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples of practical applications. The students will be encouraged to use recent Data Mining software. Course Objectives: 1. To introduce students to the basic concepts and techniques of Data Mining; 2. To provide hands-on experience in applying the concepts to real-world applications; 3. To gain experience of doing independent study and research. 9
  10. 10. Σtatistics αt KΣU Course: STAT 4310 Data Mining Instructor: Dr. Xuelei (Sherry) Ni Office: Science 462 Office Hours: By appointment Email: xni2@kennesaw.edu Content: • Introduction o What is data mining? And why data mining? • Data and Preprocessing o Data sampling, data cleaning, descriptive statistics, curse of dimensionality, and feature selection • Predictive Modeling for Regression o Linear Models and Least Squares Fitting o Generalized Linear Models • Classification Methods o Linear methods (Linear Discriminant Analysis, Logistic Regression) o Tree-based methods o Naïve Bayesian methods o Model evaluation • Cluster Analysis o Partitional and hierarchical clustering methods o Graph-based and density-based methods [optional] o Cluster validation • Association Analysis o Apriori algorithm and its extensions o Pattern evaluation (subjective and objective interestingness measures) o Sequential patterns and frequent subgraph mining [optional] Learning Objectives: Upon completion of the course, students will be able to 1. Fully appreciate the concept of data as a strategic resource; 2. Understand how and when data mining can be used as a problem-solving technique; 3. Describe different methods of data mining; 4. Select an appropriate data mining technique for a specific problem; 5. Use existing data mining software to mine a prepared data set; 6. Describe the preprocessing, the analysis, and the results clearly in writing and orally; 7. Assess data analyses performed by others. Grading: • Distribution of points: 1. Class Attendance + Discussion 10% 2. Homework 20 % 10
  11. 11. 3. Take-home Midterm 30 % 4. Final Project 40 % (Presentation 15% + Report 25%) • Final Projects (choose one of the following): Σtatistics αt KΣU Course: STAT 4310 Data Mining Instructor: Dr. Xuelei (Sherry) Ni Office: Science 462 Office Hours: By appointment Email: xni2@kennesaw.edu o Review and present papers Each student who chooses this project will be expected to read, review and present a paper from the research literature. Candidate papers will be provided on the class website. For presentation, you should view this as if you were presenting the paper at a conference - be prepared to answer detailed technical questions. If you feel the work has problems, feel free to critique it. For review report, please find some guidelines on the class website. o Apply learned methods on real-world application Students who choose this project can work as a team with up to 3 persons in each group. You will choose your own interest area, the data set, and select the appropriate method learned from this class to apply on the data, and show the results (either benefits, improvements, or drawbacks) Each group should submit a project report. The project report should roughly include: 1. Motivation of the project. 2. Existing approaches 3. Method you choose and the reason, or model you create for the specific problem and the frame work 4. Experimental studies and conclusions. Presentation requirement is the same as above. And each group member should show for his/her contribution to the project. • There will be one take-home mid-term exam. One data set is given and several questions are arisen related with the methods we discussed in class. You need to solve the problems by yourself and submit a complete report including the methods you used, the result of applying the method, the interpretation of the result. POLICIES: Attendance & Assignment Policies: You are expected to attend all classes, and turn in homework sets, take-home exam, and final report by the due dates. Late submission will NOT BE ACCEPTED. While discussion/study groups are encouraged, you are expected to do your own work on homework problems that are turned in. Withdrawal Policy…The last day to withdraw from the course and possibly receive a "W" is __________________. Students who find that they cannot continue in college for the entire semester after being enrolled, because of illness or any other reason, need to complete an online form. To completely 11
  12. 12. or partially withdraw from classes at KSU, a student must withdraw online at www.kennesaw.edu, under Owl Express, Student Services. The date the withdrawal is submitted online will be considered the official KSU withdrawal date which will be used in the calculation of any tuition refund or refund to Federal student aid and/or HOPE scholarship programs. It is advisable to print the final page of the withdrawal for your Σtatistics αt KΣU Course: STAT 4310 Data Mining Instructor: Dr. Xuelei (Sherry) Ni Office: Science 462 Office Hours: By appointment Email: xni2@kennesaw.edu records. Withdrawals submitted online prior to midnight on the last day to withdraw without academic penalty will receive a “W” grade. Withdrawals after midnight will receive a “WF”. Failure to complete the online withdrawal process will produce no withdrawal from classes. Call the Registrar’s Office at 770-423-6200 during business hours if assistance is needed. Students may, by means of the same online withdrawal and with the approval of the university Dean, withdraw from individual courses while retaining other courses on their schedules. This option may be exercised up until _______________ . This is the date to withdraw without academic penalty for Fall Term, 2006 classes. Failure to withdraw by the date above will mean that the student has elected to receive the final grade(s) earned in the course(s). The only exception to those withdrawal regulations will be for those instances that involve unusual and fully documented circumstances. Academic Integrity: Each student is responsible for upholding the provisions of the Student Code of Conduct, as published in the Undergraduate and Graduate Catalogs. For any questions involving these or any other Academic Honor Code issues, please consult http://www.kennesaw.edu/judiciary/code.conduct.shtml 12
  13. 13. Σtatistics αt KΣU Course: STAT 4310 Data Mining Instructor: Dr. Xuelei (Sherry) Ni Office: Science 462 Office Hours: By appointment Email: xni2@kennesaw.edu WEEKLY SCHEDULE (Subject to change) Week Lect Topic Notes # week 1 1 Topic 1. Introduction Syllabus 2 Topic 2. Data (Sampling & Cleaning) week 2 3 Topic 2. Data (Descriptive Statistics) 4 Topic 2. Data (Curse of Dimensionality) Final Project Initiation Due week 3 5 Topic 3. Regression (Multiple Linear Regression) 6 Topic 3. Regression (Generalized) week 4 7 Discussion 8 Topic 4. Classification (Linear Method: LDA) week 5 9 Topic 4. Classification (Linear Method: Logistic Regression) 10 Topic 4. Classification (Nearest Neighbor) week 6 11 Topic 4. Classification (Bayesian classifiers) 12 Topic 4. Classification (Over fitting & Model Evaluation) week 7 13 Topic 4. Classification Midterm (Ensemble Method and Class Imbalance assigned Problem) 14 Discussion week 8 15 Topic 5. Cluster Analysis (Center-based) 16 Topic 5. Cluster Analysis (Hierarchical) week 9 17 Topic 5. Cluster Analysis (Density-based) Midterm report due 18 Topic 5. Cluster Analysis (Cluster Validation) week 10 19 Discussion 20 Topic 6. Association Analysis (Apriori) week 11 21 Topic 6. Association Analysis (Maximal, Closed & FP-growth) 22 Topic 6. Association Analysis (Pattern Evaluation) week 12 23 Topic 6. Association Analysis (Continuous, Categorical, Concept Hierarchies) 24 Discussion week 13 25 Discussion 26 Project Presentation Final project report due 13
  14. 14. week 14 Project Presentation Project Presentation week 15 Project Presentation Project Presentation KENNESAW STATE UNIVERSITY UNDERGRADUATE PROPOSAL New Course (NOT General Education) Course Prefix and Number: STAT 4310     Responsible Department: Department of Mathematics and Statistics     Proposed Effective Date:       Signature Page Submitted by:       Name Xuelei (Sherry) Ni Date _____________________________________________________________ Name Jennifer Lewis Priestley Date       Approved       Not Approved Department Curriculum Committee Date       Approved       Not Approved General Education Council* Date       Approved       Not Approved Professional Teacher Education Unit Program Area* Date       Approved       Not Approved Department Chair Date       Approved       Not Approved College/School Curriculum Committee AND/OR Teacher Education Council* Date       Approved       Not Approved College/School Dean Date       Approved       Not Approved Undergraduate Policies and Curriculum Committee Date       Approved       Not Approved Dean of University College Date *For curriculum proposals involving General Education courses, there should be collaboration by the Department Curriculum Committee and the General Education Council. For Teacher Preparation proposals, there should be collaboration by the Department Curriculum Committee, the Professional Teacher Education Unit (PTEU) Program Area Committee, the Teacher Education Council, and the College/School Curriculum Committee. 14

×