• Like
Mining Frequent Patterns Without Candidate Generation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Mining Frequent Patterns Without Candidate Generation

  • 981 views
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
981
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
30
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • The first technology I like to … The above picture is, in my opinion, a good description of the task of knowledge discovery in that it illustrates a huge search space that contains very very few interesting things, and if applied in practice, KDD is frequently like finding a needle in a hay stack, except that you are not sure what you are looking for...

Transcript

  • 1. Introduction --- Part2
    • Another Introduction to Data Mining
    • Course Information
  • 2. Knowledge Discovery in Data [and Data Mining] (KDD)
    • Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)
    • Frequently, the term data mining is used to refer to KDD.
    • Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html )
    • Field is more dominated by industry than by research institutions
    Let us find something interesting!
  • 3. Motivation: “Necessity is the Mother of Invention”
    • Data explosion problem
      • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
    • We are drowning in data, but starving for knowledge!
    • Solution: Data warehousing and data mining
      • Data warehousing and on-line analytical processing (“ analyzing and mining the raw data rarely works ”)
      • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
  • 4. What’s New? What’s Interesting? Predict for me YAHOO!’s View of Data Mining http://www.sigkdd.org/kdd2008/ ACME CORP ULTIMATE DATA MINING BROWSER
  • 5. Data Mining: A KDD Process
      • Data mining: the core of knowledge discovery process.
    Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
  • 6. Steps of a KDD Process
    • Learning the application domain:
      • relevant prior knowledge and goals of application
    • Creating a target data set: data selection
    • Data cleaning and preprocessing:
    • Data reduction and transformation (the first 4 steps may take 75% of effort!) :
      • Find useful features, dimensionality/variable reduction, invariant representation.
    • Choosing functions of data mining
      • summarization, classification, regression, association, clustering.
    • Choosing the mining algorithm(s)
    • Data mining : search for patterns of interest
    • Pattern evaluation and knowledge presentation
      • visualization, transformation, removing redundant patterns, etc.
    • Use of discovered knowledge
  • 7. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  • 8. Are All the “Discovered” Patterns Interesting?
    • A data mining system/query may generate thousands of patterns, not all of them are interesting.
      • Suggested approach: Human-centered, query-based, focused mining
    • Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful , novel, or validates some hypothesis that a user seeks to confirm
    • Objective vs. subjective interestingness measures:
      • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
      • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
  • 9. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
  • 10. Data Mining Competitions
    • Netflix Price: http://www.netflixprize.com//index
    • KDD Cup 2009: http://www.kddcup-orange.com /
  • 11. Summary
    • Data mining: discovering interesting patterns from large amounts of data
    • A natural evolution of database technology, in great demand, with wide applications
    • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
    • Mining can be performed in a variety of information repositories
    • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
    • Classification of data mining systems
    • Major issues in data mining
  • 12. COSC 6335 in a Nutshell Preprocessing Data Mining Post Processing Association Analysis Pattern Evaluation Clustering Visualization Summarization Classification & Prediction
  • 13. Prerequisites
    • The course is basically self contained; however, the following skills are important to be successful in taking this course:
    • Basic knowledge of programming
    • Java and data mining tools will be used in the programming projects
    • Basic knowledge of statistics
    • Basic knowledge of data structures
  • 14. Course Objectives
    • will know what the goals and objectives of data mining are
    • will have a basic understanding on how to conduct a data mining project
    • will obtain practical experience in data analysis and making sense out of data
    • will have sound knowledge of popular classification techniques, such as decision trees, support vector machines and nearest-neighbor approaches.
    • will know the most important association analysis techniques
    • will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, grid-based, hierarchical and supervised clustering.
    • will know about software environments and design for data mining
    • will obtain practical experience in designing data mining algorithms and in applying data mining techniques to real world data sets.
    • will have some exposure to more advanced topics, such as sequence mining data streams and spatial data mining.
  • 15. Data Mining Course Organization
    • I Introduction to Data Mining and Data Mining Basics (Chapter 1 and 2.1)
    • II Exploratory Data Analysis (Chapter 3)
    • III Introduction to Classification --- Basic Concepts and Decision Trees (Chapter 4
    • IV Introduction to Similarity Assessment and Clustering (Other material 2.3 and Chapter 8 in part)
    • V Introduction to Data Cubes (Section 3.4)
    • VI Association Analysis (Chapter 6 and Chapter 7 in part)
    • VII Data Preprocessing (Chapter 2 and other material)
    • VIII More on Clustering (Chapter 8 and Chapter 9 in part)
    • IX Spatial Data Mining
    • X Software Design and Software Engineering for Knowledge Discovery Projects
    • XI More on Classification: Instance-based Learning and Support Vector Machines (Chapter 5)
    • XII PageRank and other Top 10 Data Mining Algorithms
    • XII I Final Words
  • 16. Order of Coverage
    • Introduction  Data  Exploratory Data Analysis  Classification  Similarity Assessment  Clustering  OLAP and Data Warehousing  Association Analysis  Preprocessing  More on Clustering  Sequence and Graph Mining  Spatial Data Mining  More on Classification  Mining Data Streams  Summary
    • Also: Software Design for Data Mining ( (likely second half of September)
  • 17. Where to Find References?
    • Data mining and KDD
      • Conference proceedings: ICDM, KDD, PKDD, PAKDD, SDM,ADMA etc.
      • Journal: Data Mining and Knowledge Discovery
    • Database field (SIGMOD member CD ROM):
      • Conference proceedings: VLDB, ICDE, ACM-SIGMOD, CIKM
      • Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
    • AI and Machine Learning:
      • Conference proceedings: ICML, AAAI, IJCAI, ECML, etc.
      • Journals: Machine Learning, Artificial Intelligence, etc.
    • Statistics:
      • Conference proceedings: Joint Stat. Meeting, etc.
      • Journals: Annals of statistics, etc.
    • Visualization:
      • Conference proceedings: CHI, etc.
      • Journals: IEEE Trans. visualization and computer graphics, etc.
  • 18. Textbooks
      • Required Text: P.-N. Tang, M. Steinback, and
      • V. Kumar: Introduction to Data Mining ,
      • Addison Wesley, Link to Book HomePage
      • Mildly Recommended Text Jiawei Han and
      • Micheline Kamber, Data Mining: Concepts and
      • Techniques , Morgan Kaufman Publishers, second
      • edition.
      • Link to Data Mining Book Home Page
  • 19. Tentative Schedule
    • Exams: October 15, December 3
    • Reviews: September 24, October 13,…
    • Plan First Half of the Fall 2009 Semester:
    • Aug. 25+27+Sept 1: Introduction to DM
    • September 1+3+8: Exploratory Data Analysis
    • September 10+17: Lab (Java, Cougar^2, SD)
    • September 8+15+22: Classification I
    • September 22+24+29+October 1: Clustering I
    • October 3: OLAP and Data Cubes
    • October 8+10+13: Association Analysis
  • 20. 2009 Assignments Assignment1: Getting Familiar with Cougar^2 (please attend the lab classes on September 10 and 17) Assignment2: Exploratory Data Analysis Assignment3: Making Sense of Data using Traditional and Clustering with Plug-in Fitness Functions Assigment4: Review for Midterm Exam (contains paper and pencil questions covering classfication, clustering, and association analysis) Assignment 5: TBDL (will require programming) Assignment 6: Preparation for the Final Exam (contains paper and pencil question covering preprocessing, outlier detection, advanced clustering and classification, and sequence mining)
  • 21. TA/Students of my Research Group:
    • Duties:
      • Grading of programming projects, home works, and exams (in part)
      • Run 2/3 labs
      • Help students with homework, programming projects and problems with the course material
      • Teach a class (rare)
    • Office:
    • Office Hours:
    • E-mail:
    • Meet our TA: next week
  • 22. Web
    • Course Webpage ( http://www2.cs.uh.edu/~ceick/DM/DM09.html )
    • UH-DMML Webpage ( http://www.tlc2.uh.edu/dmmlg )
  • 23. Teaching Philosophy and Advice
    • The first 8 weeks will give a basic introduction to data mining and follows the textbook somewhat closely.
    • Read the sections of the textbook before you come to the lecture; if you work continuously for the class you will do better and lectures will be more enjoyable. Starting to review the material that is covered in this class 1 week before the next exam is not a good idea.
    • Do not be afraid to ask questions! I really like interactions with students in the lectures… If you do not understand something at all send me an e-mail before the next lecture!
    • If you have a serious problem talk to me, before the problem gets out of hand.
  • 24. Course Planning for Research in Data Mining
    • This course “Data Mining”
    • I also suggest to taking at least 1, preferably two, of the following courses: Pattern Classification (COSC 6343), Artificial Intelligence (COSC 6368), and Machine Learning (COSC 6342).
    • Moreover, having basic knowledge in data structures, software design, and databases is important when conducting data mining projects; therefore, taking COSC 6320, COSC 6318 and COSC 6340 is a good choice.
    • Moreover, taking a course that teaches high performance computing is also a good choice, because data mining algorithms are very time consuming.
    • Because a lot of data mining projects have to deal with images, I suggest to take at least one of the many biomedical image processing courses that are offered in our curriculum.
    • Finally, having knowledge in evolutionary computing, software engineering, data visualization, statistics, solving optimization problems, GIS (geographical information systems) is a plus!