Your SlideShare is downloading.
×

- 1. Data Warehouse and Data Mining BY Dr . ANUPAM GHOSH Date: 17.01.23 Email: anupam.ghosh@rediffmail.com https://vidwan.inflibnet.ac.in/profile/319457 Academic Profile: https://www.nsec.ac.in/fps/faculty.php?id=138 Research Profile: https://www.researchgate.net/profile/Anupam-Ghosh-5 Professional Profile: https://www.linkedin.com/in/anupam-ghosh-1504273b/?originalSubdomain=in
- 2. Data Mining: A KDD Process discovery process. Databases Task-relevant Data Data Selection Data Preprocessing Data Warehouse Data Cleaning Data Integration Data Mining Pattern Evaluation – Data mining: the core of knowledge
- 3. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration StatisticalAnalysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
- 4. Data Mining: Confluence of Multiple Disciplines Data Mining Database T echnology Statistics Other Disciplines Information Science Machine Learning Visualization
- 5. Clustering • Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster • Can be formalized using distance metrics in several ways – Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized • Centroid: point defined by taking average of coordinates in each dimension. – Another metric: minimize average distance between every pair of points in a cluster • Has been studied extensively in statistics, but on small data sets – Data mining systems aim at clustering techniques that can handle very large data sets – E.g., the Birch clustering algorithm (more shortly)
- 6. Classification
- 7. Supervised Classification Training samples are labeled
- 8. Classification • Data mining is the process of semi-automatically analyzing large databases to find useful patterns • Prediction based on past history • Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history • Predict if a pattern of phone calling card usage is likely to be fraudulent • Some examples of prediction mechanisms: • Classification • Given a new item whose class is unknown, predict to which class it belongs • Regression formulae • Given a set of mappings for an unknown function, predict the function result for a new parameter value
- 9. Linear Regression ❑ Linear regression and modelling problems are presented along with theirsolutions. ❑ If the plot of n pairs of data (x , y) for an experiment appear to indicate a "linear relationship" between y and x, then the method of least squares may be used to write a linear relationship between x and y. ❑ Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x
- 10. ▶ The least square regression line for the set of n data points is given by the equation of a line in slope intercept form: ▶ y =a x +b
- 11. Troubleshooting --Problem 1 Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)} a) Find the least square regression line for the given data points. b) Plot the given points and the regression line in the same rectangular system of axes.
- 12. Problem 2 a) Find the least square regression line for the following set of data { (-1 , 0),(0 , 2),(1 , 4),(2 , 5)} b) Plot the given points and the regression line in the same rectangular system of axes.
- 13. Problem 3 ▶ The values of y and their corresponding values of y are shown in the table below X 0 1 2 3 4 y 2 3 5 4 6 a) Find the least square regression line y =a x +b. b) Estimate the value of y when x =10.
- 14. Problem 4 ▶ The sales of a company (in million dollars) for each year are shown in the table below. x (year) 2005 y (sales) 12 2006 2007 2008 2009 19 29 37 45 ▶ a) Find the least square regression line y =a x +b. ▶ b) Use the least squares regression line as a model to estimate the sales of the company in 2012.
- 15. Decision Theory Supervised Learning
- 16. Which Attribute is ”best”? We would like to select the attribute that is most useful for classifying examples. • Information gain measures how well a given attribute separates the training examples according to their target classification. • ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree. • In order to define information gain precisely, we use a measure commonly used in information theory, called entropy • Entropy characterizes the (im)purity of an arbitrary collection of examples.
- 17. Information Theory –ID3 (Iterative Dichotomiser 3) ❖ ID3 algorithm invented by Ross Quinlan and uses information gain as its attribute selection measure ❖ This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages ❖ Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N ❖ This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions ❖ The expected information needed to classify a tuple in D is given by Let D, the data partition, be a training set of class-labeled tuples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (here I = 1 to m); pi = si/s; s= no. of samples; si = no. of samples in class label Ci ; Info(D) is also known as the entropy of D
- 18. ID3--Continued suppose we were to partition the tuples in D on some attribute A having v distinct values, [a1,a2, … , av], as observed from the training data. If A is discrete-valued, these values correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets, [D1, D2, …, Dv], where Djcontains those tuples in D that have outcomeajof A Here, |Dj|/|D|= acts as the weight of the jthpartition; InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A. Info(Dj) = -σ𝑖=1 𝑚 𝑝ij log2(pij); pij= sij/|Dj|; sij = no. of samples belongs to class label Ci and having the attribute value aj
- 19. ID3--Continued Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in the information requirement caused by knowing the value of A. The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N.
- 20. Problem statement: Find out Test Attribute
- 21. Solution: Entropy(Syouth) = − σ𝑖=1 2 𝑝i1 log2(pi1) = - p11log2(p11) - p21log2(p21) = -2/5 log2(2/5) – 3/5 log2(3/5) = 0.971 Here, p11= s11/|D1| = 2/5 p21 = s21/|D1| = 3/5 log2 X = log10 X / log10 2 Entropy(Smiddle) = − σ𝑖=1 2 𝑝i2 log2(pi2) = - p12log2(p12) - p22log2(p22) = -4/4 log2(4/4) – 0/4 log2(0/4) = 0 Here, p12= s12/|D2| = 4/4 p22 = s22/|D2| = 0/4
- 22. Decision Tree X = (age = youth, income = medium, student = yes, credit = fair) Class label=?