Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elementary Concepts of data minig


Published on

Mathematical analysis of Graph and Huff amn coding

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Elementary Concepts of data minig

  1. 1. Elementary Concepts Data Mining Technology Anjan.K II Sem M.Tech CSE M.S.R.I.T
  2. 2. Agenda  Need for Dimensionality Reduction  PCA revisited  Data Mining elementary concepts  Hands On Problem-Q3  Potter’s Wheel-Data Cleaning Tool
  3. 3. Need for Dimensionality Reduction  It is easy to collect data but accumulates in an unprecedented speed.  Data is not collected only for data mining  Data preprocessing is an important part for effective machine learning and data mining.  Dimensionality reduction is an effective approach to downsizing data
  4. 4. Dimensionality Reduction?  Learning and data mining techniques may not be effective for high-dimensional data due its dimensionality.  Query accuracy and efficiency degrade rapidly as the dimension increases.  Visualization: projection of high-dimensional data onto 2D or 3D.  Data compression: efficient storage and retrieval.  Noise removal: positive effect on query accuracy.
  5. 5. Principal Component Analysis  PCA is a statistical technique used in face recognition and image compression and is unsupervised linear algorithm.  A common technique for finding patterns in data of high dimension. Mining for principal component in image.  Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables  Retains most of the sample's information. Ex: High resolution image transformed to low resolution image.
  6. 6. Geometric Picture of Principal Components (PCs)
  7. 7. Algebraic Derivation of PCs
  8. 8. Knowledge Discovery (KDD) Process  Data mining—core of Pattern Evaluation knowledge discovery process Data Mining Task-relevant Data Data Selection Warehouse Data Cleaning Data Integration Databases
  9. 9. Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Machine Visualization Learning Data Mining Pattern Other Recognition Algorithm Disciplines
  10. 10. Question 3 (1.3 of Chap 1 of Han & Kamber)  Suppose your task as a software engineer at Big university is to design a data mining system to examine the university course database, which contains the following information: name, address, status, course taken, the cumulative grade point average(GPA) of each student. Describe the architecture you would choose. What is the purpose of each component of this architecture?
  11. 11. Proposed Data Mining Technology
  12. 12. Data Mining System OLAP Tools College DB Data Mining Exam DB University system Warehouse University DB Pattern Response Attribution Evaluation Graphical Back Office Systems Interface
  13. 13. Potter‘s Wheel  Problem of conventional approaches  Time consuming (many iterations), long waiting periods  Users have to write complex transformation scripts  Separate Tools for auditing and transformation  Potter‘s Wheel approach:  Interactive system, instant feedback  Integration of both, data auditing and transformation  Intuitive User Interface – spreadsheet like application
  14. 14. Potter‘s Wheel
  15. 15. Potter’s Wheel- features  Instead of complex transform specifications with regular expressions or custom programs  user specifies by example (e.g. splitting)  Data auditing extensible with user defined domains  Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]* <Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*  Allows easier detection of e.g. logical errors like false airport codes  Problem: tradeoff between overfitting and underfitting structure  Potter‘s Wheel uses Minimun description length method to balance this tradeoff and choose appropriate structure  Data auditing in background on the fly (data streaming also possible)  Reorderer allows sorting on the fly  User only works on a view – real data isn‘t changed until user exports set of transforms e.g. as C program an runs it on the real data  Undo without problems: just delete unwanted transform from sequence and redo everything else
  16. 16. Potter‘s Wheel - Conclusion  Problems:  Usability of User Interface  How does duplicate elimination work?  Kind of a black box system  General Open Problems of Data Cleaning:  (Automatic) correction of wrong values  Mask wrong values but keep them  Keep several possible values at the same time (2*age. 2*birthday)  Leeds to problems if other values depend on a certain alternative and this turns out to be wrong  Maintenance of cleaned data, especially if sources can‘t be cleaned  Data cleaning framework desireable