On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
Why is Data Preprocessing required?Ans.The real-world data that is to be analyzed by datamining techniques are: 1. Incomplete: lacking attribute values or certain attributes of interest, or containing only aggregate data. Missing data, particularly for tuples with missing values for some attributes, may need to be inferred. 2. Noisy : containing errors, or outlier values that deviate from the expected. Incorrect data may also result from inconsistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date. It is hence necessary to use some techniques to replace the noisy data. 3. Inconsistent : containing discrepancies between different data items. some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. Naming inconsistencies may also occur for attribute values. The inconsistency in data needs to be removed. 4. Aggregate Information: It would be useful to obtain aggregate information such as to the sales per customer region—something that is
not part of any pre-computed data cube in the data warehouse. 5. Enhancing mining process: Large number of data sets may make the data mining process slow. Hence, reducing the number of data sets to enhance the performance of the mining process is important. 6. Improve Data Quality: Data preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an important step in the knowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making.Q.2. What are the different forms of DataProcessing?Ans.Data Cleaning: Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data, identifying or removing outliers,and resolving inconsistencies.If users believe the data are dirty, they areunlikely to trust the results of any datamining that has been applied to it. Also,dirty data can cause confusion for themining procedure, resulting in unreliableoutput. But, they are not always robust.Therefore, a useful preprocessing step isused some data-cleaning routines. Data Integration:Data integration involves integrating datafrom multiple databases, data cubes, orfiles.Some attributes representing a givenconcept may have different names indifferent databases, causing inconsistenciesand redundancies. For example, theattribute for customer identification may bereferred to as customer_id in one data storeand cust_id in another.Naming inconsistencies may also occur forattribute values.Also, some attributes may be inferred fromothers (e.g., annual revenue).
Having a large amount of redundant data may slow down or confuse the knowledge discovery process. Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration.Data Transformation: Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process. Normalization: Normalization is scaling the data to be analyzed to a specific range such as [0.0, 1.0] for providing better results. Aggregation: Also, it would be useful for data analysis to obtain aggregate information such as the sales per customer region. As, it is not a part of any pre- computed data cube, it would need to be computed. This process is called Aggregation.Data Reduction:
Data reduction obtains a reducedrepresentation of the data set that is muchsmaller in volume, yet produces the same (oralmost the same) analytical results. There area number of strategies for data reduction. data aggregation (e.g., building a data cube), attribute subset selection (e.g., removing irrelevant attributes through correlation analysis), dimensionality reduction (e.g., using encoding schemes such as minimum length encoding or wavelets), and numerosity reduction (e.g., “replacing” the data by alternative, smaller representations such as clusters or parametric models). generalization with the use of concept hierarchies,by organizing the concepts into varying levels of abstraction. Data discretization is very useful for the automatic generation of concept hierarchies from numerical data.
Q.3. Explain the various techniques ofDescriptive Data Summarization.Ans.Descriptive data summarization techniques can beused to identify the typical properties of your dataand highlight which data values should be treatedas noise or outliers.
1. Measuring the Central Tendency: Measures of central tendency include mean,median, mode, and midrange. Mean: Let x1,x2.., xN be a set of N values or observations, such as for some attribute, like salary. The mean of this set of values is:x-=i.=1nxⅈN=x1+x2+…+xnN This corresponds to the built-in aggregatefunction, average (avg() in SQL), provided inrelational database systems.There are kinds of measures:A distributive measure like sum() and count(),max() and min().An algebraic measure like , average (or mean()),Weighted average.Holistic measure, like median.Weighted arithmetic mean or the weightedaverage.Sometimes, each value xi in a set may beassociated with a weight wi, for i = 1,…,N. Theweights reflect the significance, importance, oroccurrence frequency attached to their respectivevalues. In this case, we can computeX-=i.=1Nxii=1Nwi=w1x1+w2x2+…+wNxNNMedian: Suppose that a given data set of N distinctvalues is sorted in numerical order. If N is odd,then the median is the middle value of the ordered
set; otherwise (i.e., if N is even), the median is theaverage of the middle two values.Mode. The mode for a set of data is the value thatoccurs most frequently in the set. Data sets withone, two, or three modes are respectively calledunimodal, bimodal, and trimodal.Skew:In a unimodal frequency curve with perfectsymmetric data distribution, the mean, median,and mode are all at the same center value.However, data in most real applications are notsymmetric. They may instead be either positivelyskewed or negatively skewed.Midrange:The midrange can also be used to assess thecentral tendency of a data set. It is the average ofthe largest and smallest values in the set. Thisalgebraic measure is easy to compute using theSQL aggregate functions, max() and min().Measuring the Dispersion of Data:The most common measures of data dispersion arerange, the five-number, the interquartile range,
and the standard deviation. Boxplots can be alsoplotted .Range is the difference between the maximum andminimum value.Quartiles: The first quartile, denoted by Q1, is the25th percentile; the third quartile, denoted by Q3,is the 75th percentile. The quartiles, including themedian, give some indication of the center,spread,and shape of a distribution.IQR: interquartile range (IQR) is defined asIQR = Q3-Q1.The five-number summary of a distributionconsists of the median, the quartilesQ1 andQ3, andthe smallest and largest individual observations,written in the order Minimum; Q1; Median; Q3;Maximum:Boxplots are a popular way of visualizing adistribution. Typically, the ends of the box are atthe quartiles, so that the box length is theinterquartile range, IQR. The median is marked bya line within the box. Two lines (called whiskers)outside the box extend to the smallest (Minimum)and largest (Maximum) observations.Variance and Standard DeviationThe variance of N observations, x1,x2,…,xN, isσ2=i.=1nxⅈ-x-2NThe standard deviation, σ, of the observations isthe square root of the variance, σ2.
Graphic Displays of Basic Descriptive DataSummaries:Apart from the bar charts, pie charts, and linegraphs used in most statistical or graphical datapresentation software packages, histograms,quantile plots, q-q plots, scatter plots, and loesscurves are also used.Plotting histograms, or frequency histograms, is agraphical method for summarizing the distributionof a given attribute.A quantile plot is a simple and effective way tohave a first look at a univariate data distribution.First, it displays all of the data for the givenattribute. Second, it plots quantile information.A quantile-quantile plot, or q-q plot, graphs thequantiles of one univariate distribution against thecorresponding quantiles of another. It is a powerfulvisualization tool in that it allows the user to view
whether there is a shift in going from onedistribution to another.A scatter plot is one of the most effective graphicalmethods for determining if there appears to be arelationship, pattern, or trend between twonumerical attributes.Figure: Scatter PlotQ.4. Explain Data Cleaning.Ans. 1. Real-world data tend to be incomplete, noisy, and inconsistent. 2. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data . 3. If users believe the data are dirty, they are unlikely to trust the results of any data mining
that has been applied to it. Also, dirty data can cause confusion for the mining procedure, resulting in unreliable output. But, they are not always robust.4. Ways to handle missing data: Ignore the tuple: This is usually done when the class label is missing . This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: This approach is time-consuming and may not be feasible given a large data set with many missing values. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown”. Hence, although this method is simple, it is not foolproof. Use the attribute mean to fill in the missing value: For example, suppose that the average income of customers is $56,000. This value to used to replace the missing value for income. Use the attribute mean for all samples belonging to the same class as the given
tuple: For example, if classifying customersaccording to credit risk, replace the missingvalue with the average income value forcustomers in the same credit risk categoryas that of the given tuple.Use the most probable value to fill in themissing value: This may be determined withregression, inference-based tools using aBayesian formalism, or decision treeinduction. For example, using the othercustomer attributes in your data set, youmay construct a decision tree to predict themissing values for income. This is a popularstrategy. In comparison to the othermethods, it uses the most information fromthe present data to predict missing values.By considering the values of the otherattributes in its estimation of the missingvalue for income, there is a greater chancethat the relationships between income andthe other attributes are preserved.Methods 3-6 bias the data. The filled-invalue may not be correct.It is important to have a good design ofdatabases and of data entry procedures
should help minimize the number of missing values or errors in the first place.1. Ways to handle Noisy Data: Noise is a random error or variance in a measured variable. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equal-width, where the interval range of values in each bin is constant.
Binning is also used as a discretization technique.2. Regression: Data can be smoothed by fitting thedata to a function, such as with regression. Linearregression involves finding the “best” line to fit twoattributes (or variables), so that one attribute canbe used to predict the other. Multiple linearregression is an extension of linear regression,where more than two attributes are involved andthe data are fit to a multidimensional surface.3. Clustering: Outliers may be detected byclustering, where similar values are organized into
groups, or “clusters.” Intuitively, values that falloutside of the set of clusters may be consideredoutliers.Data Cleaning as a process:1) Discrepancy detection: Discrepancies can becaused by several factors, including poorlydesigned data entry forms that have many optionalfields,data entry errors, deliberate errors, and datadecay, inconsistency in data representations.For detecting discrepancy, metadata- knowledgethat is already known or properties of the data canbe used. For example, what are the domain anddata type, acceptable values of each attribute? Thedescriptive data summaries are useful here forgrasping data trends and identifying anomalies. Field overloading is another source of error thattypically results when developers squeeze new
attribute definitions into unused (bit) portions ofalready defined attributes (e.g., usingan unused bitof an attribute whose value range uses only, say,31 out of 32 bits).The data should also be examined regardingunique rules, consecutive rules, and null rules.Data scrubbing tools use simple domain knowledge(e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections inthe data. These tools rely on parsing and fuzzymatching techniques when cleaning data frommultiple sources. Data auditing tools finddiscrepancies by analyzing the data to discoverrules and relationships, and detecting data thatviolate such conditions. They are variants of datamining tools. For example, they may employstatistical analysis to find correlations, or clusteringto identify outliers.Some data inconsistencies may be correctedmanually using external references.2) Data Transformation: Most errors, however, willrequire data transformations. This is the secondstep in data cleaning as a process. That is, oncewe find discrepancies, we typically need to defineand apply (a series of) transformations to correctthem.The two-step process of discrepancy detection anddata transformation iterates. Transformations are
often done as a batch process. Only after thetransformation is complete can the user go backand check that no new anomalies have beencreated by mistake. Thus, the entire data cleaningprocess also suffers from a lack of interactivity.New approaches to data cleaning emphasizeincreased interactivity. Also, it is important to keepupdating the metadata to reflect this knowledge.This will help speed up data cleaning on futureversions of the same data store.Q.5. Explain Data Integration and DataTransformation.Ans.Data integration combines data from multiplesources into a coherent data store, as in datawarehousing.These sources may include multiple databases,data cubes, or flat files.Issues to consider during Data Integration:Entity Identification ProblemSchema integration and object matching can betricky. For example, how can the data analyst orthe computer be sure that customer id in onedatabase and cust number in another refer to thesame attribute? Metadata can be used to helpavoid errors in schema integration and helptransform the data.
Redundancy: An attribute (such as annual revenue, forinstance) may be redundant if it can be “derived”from another attribute or set of attributes.Inconsistencies in attribute or dimension namingcan also cause redundancies in the resulting dataset.Some redundancies can be detected by correlationanalysis. We can evaluate the correlation betweentwo attributes, A and B, by computing thecorrelation coefficient . This isrA,B=i=1Nai-A-bi-B-NσAσB,where N is the number of tuples, ai and bi are therespective values of A and B in tuple i, A and B arethe respective mean values of A and B, sA and sBare the respective standard deviations of A and B .If rA,B> 0, then A and B are positively correlated.A higher value may indicate that A (or B) may beremoved as a redundancy.If rA,Bthen A and B are independent.If rA,B , then A and B are negatively correlated,For categorical (discrete) data, a correlationrelationship between two attributes, Aand B, can be discovered by a 2 (chi-square) test.Duplication should also be detected at the tuplelevel. Also, the use of denormalized tables isanother source of data redundancy.
Inconsistencies often arise between variousduplicates.Detection and resolution of data value conflicts.For example, for the same real-world entity,attribute values from different sources may differ.This may be due to differences in representation,scaling, or encoding. An attribute in one systemmay be recorded at a lower level of abstractionthan the “same” attribute in another.When matching attributes from one database toanother during integration, special attention mustbe paid to the structure of the data.Careful integration of the data from multiplesources can help reduce and avoid redundanciesand inconsistencies in the resulting data set. Thiscan help improve the accuracy and speed of thesubsequent mining process.Data Transformation:In data transformation, the data are transformedor consolidated into forms appropriate for mining.Data transformation can involve the following:Smoothing, which works to remove noise from thedata. Such techniques include binning, regression,and clustering.Aggregation, where summary or aggregationoperations are applied to the data. For example,the daily sales data may be aggregated so as to
compute monthly total amounts. This step istypically used in constructing a data cube.Generalization of the data, where low-level dataare replaced by higher-level concepts through theuse of concept hierarchies. For example,categorical attributes, like street, can begeneralized to higher-level concepts, like city orcountry.Normalization, where the attribute data are scaledso as to fall within a small specified range, such as-1:0 to 1:0, or 0:0 to 1:0.Normalization isparticularly useful for classification algorithmsinvolving neural networks, or distancemeasurements such as nearest-neighborclassification and clustering.Different types of Normalization methods: min-max normalization: performs a lineartransformation on the original data. Suppose thatminA and maxA are the minimum and maximumvalues of an attribute, A. Min-max normalizationmaps a value, v, of A to v0 in the range [newminA;new maxA]by computingv=v-minAmaxA-minA( new maxA-new minA)+newminAz-score normalization(or zero-meannormalization): the values for an attribute, A, arenormalized based on the mean and standard
deviation of A. A value, v, of A is normalized to v’by computingv=v-A-σAwhere A and σAare the mean and standarddeviation, respectively, of attribute A. Example 2.3z-score normalization Suppose that the mean andstandard deviation of the values for the attributeincome are $54,000 and $16,000, respectively.With z-score normalization, a value of $73,600 forincome is transformed to (73600-54000)/16000 =1:225.normalization by decimal scaling:normalizes by moving the decimal point of valuesof attribute A. The number of decimal pointsmoved depends on the maximum absolute valueof A. A value, v, of A is normalized tov’ bycomputing v’ =v/10^jMin-max normalization. Example Min-maxnormalization. Suppose that the minimum andmaximum values for the attribute income are$12,000 and $98,000, respectively. We would liketo map income to the range [0:0;1:0]. By min-max normalization, a value of $73,600 for incomeis transformedto (73600-12000)/(98000-12000) (1-0)+0 =0:716.
normalization can change the original data quite abit, especially the latter two methods shownabove.Attribute construction (or feature construction),where new attributes are constructed and addedfrom the given set of attributes to help the miningprocess. new attributes are constructed from thegiven attributes and added in order to helpimprove the accuracy and understanding ofstructure in high-dimensional data. For example,we may wish to add the attribute area based onthe attributes height and width. By combiningattributes, attribute construction can discovermissing information about.Q.6. Explain the Data Cleaning, DataIntegration and Transformation in detail.Ans.(Explain main points of the above two questions).Q.7. Explain Data Reduction.Ans.Data reduction techniques can be applied to obtaina reduced representation of the data set that ismuch smaller in volume, yet closely maintains theintegrity of the original data. Mining on the
reduced data set should be more efficient yetproduce the same (or almost the same) analyticalresults.Strategies for data reduction include the following:1. Data cube aggregation, where aggregationoperations are applied to the data in theconstruction of a data cube.-2. Attribute subset selection, where irrelevant,weakly relevant, or redundant attributes ordimensions may be detected and removed.3. Dimensionality reduction, where encodingmechanisms are used to reduce the dataset size.4. Numerosity reduction, where the data arereplaced or estimated by alternative, smaller datarepresentations such as parametric models ornonparametric methods such as clustering,sampling, and the use of histograms.5. Discretization and concept hierarchy generation,where raw data values for attributes are replacedby ranges or higher conceptual levels. It is a formof numerosity reduction that is very useful for theautomatic generation of concept hierarchies.Discretization and concept hierarchy generationare powerful tools for data mining, in that theyallow the mining of data at multiple levels ofabstraction.Q.8. Explain different Data ReductionTechniques.
Ans.1. Data cube aggregation, where aggregationoperations are applied to the data in theconstruction of a data cube. For example, data forsales per quarter is known. But, one might beinterested in the annual sales, rather than the totalper quarter. Thus the data can be aggregated sothat the resulting data summarize the total salesper year instead of per quarter. The resulting dataset is smaller in volume, without loss ofinformation necessary for the analysis task.Attribute Subset Selection: Attribute subsetselection reduces the data set size by removingirrelevant or redundant attributes (or dimensions).The goal is to find a minimum set of attributessuch that the resulting probability distribution ofthe data classes is as close as possible to theoriginal distribution obtained using all attributes. Ithelps to make the patterns easier to understand.
Heuristic methods are commonly used for attributesubset selection. These methods are typicallygreedy.diagramBasic Heuristic methods are:1. Stepwise forward selection: The procedurestarts with an empty set of attributes as thereduced set. The best of the original attributes isdetermined and added to the reduced set. At eachsubsequent iteration or step, the best of theremaining original attributes is added to the set.2. Stepwise backward elimination: The procedurestarts with the full set of attributes. At each step, itremoves the worst attribute remaining in the set.3. Combination of forward selection and backwardelimination: The stepwise forward selection andbackward elimination methods can be combined sothat, at each step, the procedure selects the bestattribute and removes the worst from among theremaining attributes.4. Decision tree induction: Decision tree inductionconstructs a flowchart like structure where eachinternal node denotes a test on an attribute, eachbranch corresponds to an outcome of the test, andeach external (leaf) node denotes a classprediction. At each node, the algorithm choosesthe “best” attribute to partition the data intoindividual classes.
Dimensionality Reduction: In dimensionalityreduction, data encoding or transformations areapplied so as to obtain a reduced or “compressed”representation of the original data. If the originaldata can be reconstructed from the compresseddata without any loss of information, the datareduction is called lossless. If, instead, we canreconstruct only an approximation of the originaldata, then the data reduction is called lossy.Wavelet TransformsIt is a lossy compression technique. The discretewavelet transform(DWT) is a linear signalprocessing technique that, when applied to a datavector X, transforms it to a numerically differentvector, X, of wavelet coefficients. The wavelettransformed data can be truncated. Thus, acompressed approximation of the data can be
retained by storing only a small fraction of thestrongest of the wavelet coefficients. Anapproximation of the original data can beconstructed by applying the inverse of the DWTused.The DWT is closely related to the discrete Fouriertransform(DFT), a signal processing techniqueinvolving sines and cosines. However, the DWTachieves better lossy compression.The method for DWT is the pyramid algorithm thathalves the data at each iteration, resulting in fastcomputational speed.1. The length, L, of the input data vector must bean integer power of 2. This condition can be metby padding the data vector with zeros (L >= n).2. Each transform involves applying two functions.The first applies some data smoothing, such as asum or weighted average. The second performs aweighted difference, which acts to bring out thedetailed features of the data.3. The two functions are applied to pairs of datapoints in X, that is, to all pairs of measurements(x2i;x2i+1). This results in two sets of data oflength L=2. In general, these represent asmoothed or low-frequency version of the inputdata and the high frequency content of it,respectively.
4. The two functions are recursively applied to thesets of data obtained in the previous loop, until theresulting data sets obtained are of length 2.5. Selected values from the data sets obtained inthe above iterations are designated the waveletcoefficients of the transformed data.Major features: Complexity O(n) can be applied to multidimensional data, such as a data cube. give good results on sparse or skewed data and on data with ordered attributes. better than JPEG compressionPrincipal Component Analysis: PCA (also called the Karhunen-Loeve, or K-L,method), searches for k n-dimensional orthogonalvectors that can best be used to represent thedata, where k <= n. The original data are thusprojected onto a much smaller space, resulting indimensionality reduction.PCA “combines” the essence of attributes bycreating an alternative, smaller set of variables.The basic procedure is as follows:1. The input data are normalized, so that eachattribute falls within the same range.
2. PCA computes k orthonormal vectors thatprovide a basis for the normalized input data.These vectors are referred to as the principalcomponents.3. The principal components are sorted in order ofdecreasing “significance” or strength. The principalcomponents essentially serve as a new set of axesfor the data, providing important information aboutvariance. For example, Figure 2.17 shows the firsttwo principal components, Y1 and Y2, for the givenset of data originally mapped to the axes X1 andX2. This information helps identify groups orpatterns within the data.4. Because the components are sorted according todecreasing order of “significance,” the size of thedata can be reduced by eliminating the weakercomponents.Features:is computationally inexpensive,can be applied to ordered and unorderedattributes,
can handle sparse data and skewed data. Multidimensional data of more than twodimensions can be handled. In comparison with wavelet transforms, PCA tendsto be better at handling sparse dataNumerosity Reduction:Numerosity Reduction technique can be applied toreduce the data volume by choosing alternative,smaller forms of data representation.These techniques may be parametric ornonparametric. For parametric methods, a modelis used to estimate the data, so that typically onlythe data parameters need to be stored, instead ofthe actual data. Log-linear models, which estimatediscrete multidimensional probability distributions,are an example.Nonparametric methods for storing reducedrepresentations of the data include histograms,clustering, and sampling.Different techniques used are:Regression and Log-Linear ModelsRegression and log-linear models can be used toapproximate the given data. In (simple)linear regression, the data are modeled to fit astraight line. y = wx+b,where the variance of y is assumed to be constant.The coefficients, w and b (called regression
coefficients), specify the slope of the line and theY-intercept, respectively.Multiple linear regression is an extension of(simple) linear regression, which allows a responsevariable, y, to be modeledas a linear function oftwo or more predictor variables.Log-linear models approximate discretemultidimensional probability distributions. We canconsider each tuple from a set of given n-tuples asa point in an n-dimensional space. Log-linearmodels can be used to estimate the probability ofeach point in a multidimensional space for a set ofdiscretized attributes, based on a smaller subset ofdimensional combinations.This allows a higher-dimensional data space to beconstructed from lower dimensional spaces. Log-linear models are therefore also useful fordimensionality reduction and data smoothing.Features: both be used on sparse data. Regression handles skewed data well. Regression can be computationally intensive when applied to high dimensional data, whereas log-linear models show good scalability for up to 10 or so dimensions.Histograms
Histograms use binning to approximate datadistributions and are a popular form of datareduction. A histogram for an attribute, A,partitions the data distribution of A into disjointsubsets, or buckets. If each bucket represents onlya single attribute-value/frequency pair, the bucketsare called singleton buckets. Often, bucketsinstead represent continuous ranges for the givenattribute.There are several partitioning rules, fordetermining the buckets and the attribute values: Equal-width: Equal-frequency (or equidepth): V-Optimal: MaxDiff:Features:highly effective at approximating both sparse anddense data, highly skewed and uniform data. Multidimensional histograms can capturedependenciesbetween attributes.
CLustering:Clustering partitions the objects (data tuples) intogroups or clusters, so that objects within a clusterare “similar” to one another and “dissimilar” toobjects in other clusters.Similarity is commonly defined in terms of how“close” the objects are in space, based on adistance function. The “quality” of a cluster may berepresented by its diameter, the maximumdistance between any two objects in the cluster, orby Centroid distance, the average distance of eachcluster object from the cluster centroid .The cluster representations of the data are used toreplace the actual data. The effectiveness of thistechnique depends on the nature of the data. It ismuch more effective for organized data than forsmeared data.
Sampling:Sampling can be used as a data reductiontechnique because it allows a large data set to berepresented by a much smaller random sample (orsubset) of the data. Suppose that a large data set,D, contains N tuples.Simple random sample without replacement(SRSWOR) of size s: This is created by drawing sof the N tuples from D (s < N), where theprobability of drawing any tuple is the same.Simple random sample with replacement (SRSWR)of size s: This is similar to SRSWOR, except thateach time a tuple is drawn from D, it is recordedand then replaced. That is, after a tuple is drawn,it is placed back in D so that it may be drawnagain.
Cluster sample: If the tuples in D are grouped intoM mutually disjoint “clusters,” then an SRS of sclusters can be obtained, where s < M. Forexample, tuples in a database are usually retrieveda page at a time, so that each page can beconsidered a cluster. A reduced datarepresentation can be obtained by applying, say,SRSWOR to the pages, resulting in a clustersample of the tuples. Other clustering criteriaconveying rich semantics can also be explored.For example, in a spatial database, we may chooseto define clusters geographically based on howclosely different areas are located.Stratified sample: If D is divided into mutuallydisjoint parts called strata, a stratified sample of Dis generated by obtaining an SRS at each stratum.This helps ensure a representative sample,especially when the data are skewed.For example, a stratified sample may be obtainedfrom customer data, where a stratum is created foreach customer age group. In this way, the agegroup having the smallest number of customerswill be sure to be represented.Features: cost of obtaining a sample is proportional to the size of the sample, s, not N.
is most commonly used to estimate the answer to an aggregate query. is a natural choice for the progressive refinement of a reduced data set.Q.9. Explain Numerosity Reduction in datapreprocessing.Ans.Refer the above question.Q.10. Describe Data Discretization and Datasummarization with an example.Ans.Data discretization techniques can be used toreduce the number of values for a givencontinuous attribute by dividing the range of theattribute into intervals. Interval labels can then beused to replace actual data values.Replacingnumerous values of a continuous attribute by asmall number of interval labels reduces andsimplifies the original data.Discretization techniques can be categorized basedon how the discretization is performed, such astop-down vs. bottom-up.
If the discretization process uses class information,then we say it is supervised discretization, else it isunsupervised.Top-down discretization or splitting:process starts by first finding points (split pointsor cut points) to split the entire attribute range,and then repeats this recursively on the resultingintervals.Bottom-up discretization or Merging:process starts by considering all of the continuousvalues as potential split-points, removes some bymerging neighborhood values to form intervals,and then recursively applies this process to theresulting intervals.Concept hierarchies are useful for mining atmultiple levels of abstraction. A concept hierarchyfor a given numerical attribute defines adiscretization of the attribute. They can be used toreduce the data by collecting and replacing low-level concepts (such as age) with higher-levelconcepts (such as youth, middle-aged, or senior).Although detail is lost by such data generalization,the generalized data may be more meaningful andeasier to interpret. Also, it is more efficient.Discretization techniques and concept hierarchiesare typically applied as a preprocessing step.Discretization and Concept Hierarchy Generationfor Numerical Data
BinningBinning is a top-down splitting technique based ona specified number of bins. Attribute values can bediscretized, for example, by applying equal-widthor equal-frequency binning, and then replacingeach bin value by the bin mean or median, as insmoothing by bin means or smoothing by binmedians, respectively. These techniques can beapplied recursively to the resulting partitions inorder to generate concept hierarchies. Binningdoes not use class information and is therefore anunsupervised discretization technique. It issensitive to the user-specified number of bins, aswell as the presence of outliers.Histogram AnalysisIt is an unsupervised discretization technique thatpartitions the values for an attribute into disjointranges called buckets. In an equal-widthhistogram, for example, the values are partitionedinto equal-sized partitions or range. The histogramanalysis algorithm can be applied recursively toeach partition in order to automatically generate amulti-level concept hierarchy. A minimum intervalsize can also be used per level to control therecursive procedure. Histograms can also be partitioned based oncluster analysis of the data distribution.Entropy-Based Discretization
Entropy is one of the most commonly used, and isa supervised, top-down splitting technique.Let D consist of data tuples defined by a set ofattributes and a class-label attribute.The basicmethod is as follows:1. Each value of attribute A can be considered as apotential interval boundary or split-point topartition the range of A, creating datadiscretization.2. Expected information requirement forclassifying a tuple in D based on partitioning by Ais given byInfoA(D) = (|D1|/|D|)Entropy(D1)+(|D2|/|D|)Entropy(D2)where D1 and D2 correspond to the tuples in Dsatisfying the conditions A <= split point and A >split point, respectively; |D| is the number oftuples in D.the entropy of D1 is Entropy(D1) =-i=1mpilog2piwhere pi is the probability of class Ci in D1.Entropy-based discretization uses class informationand can reduce data size and improve accuracy.Segmentation by Natural Partitioning:3-4-5 rule can be used to segment numeric datainto relatively uniform, natural intervals.It partitions a given range into 3,4 or 5 equiwidthintervals recursively level-by-level based on thevalue range of the most digit.
If an interval covers 3,6,7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals. If an interval covers 2,4 or 8 distinct values at the most significant digit, partition the range into 4 intervals. If an interval covers 1,5 or 10 distinct values at the most significant digit, partition the range into 5 intervals.Concept Hierarchy Generation for Categorical DataCategorical data are discrete data. There areseveral methods for the generation of concepthierarchies for categorical data. Specification of a partial ordering of attributes explicitly at the schema level by users or experts: For e.g. street < city < province or state < country. Specification of a portion of a hierarchy by explicit data grouping: For e.g. Alberta, Saskatchewan, Manitobagprairies Canada” Specification of a set of attributes, but not of their partial ordering:
Specification of only a partial set of attributes: For e.g. instead of including all of the hierarchically relevant attributes for location, the user may have specified only street and city.Q.11. Write short note on Linear Regression.Ans. 1.Regression can be used to approximate the given data. 2. In (simple) linear regression, the data are modeled to fit a straight line. For example, a random variable, y (response variable), can be
modeled as a linear function of another random variable, x (predictor variable), with the equation y = wx+b, where the variance of y is assumed to beconstant. 1. In the context of data mining, x and y are numerical database attributes. 2. The coefficients, w and b (regression coefficients), specify the slope of the line and the Y-intercept, respectively. 3. These coefficients can be solved for by the method of least squares, which minimizes the error between the actual line separating the data and the estimate of the line. Multiple linear regression is an extension of linear regression, which allows a response variable, y, to be modeled as a linear function of two or more predictor variables.