Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUNE2014 DEC2014 AND JUNE2015

6,557 views

Published on

FOR FREE DOWNLOADING, OPEN THIS LINK: http://oke.io/ZxZW

Published in: Engineering
  • Be the first to comment

VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUNE2014 DEC2014 AND JUNE2015

  1. 1. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  2. 2. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-1 1a. What is operational data store (ODS)? Explain with neat diagram. (08 Marks) Ans: ODS (OPERATIONAL DATA STORE) • ODS is defined as a subject-oriented, integrated, volatile, current-valued data store, containing only corporate-detailed data. → ODS is subject-oriented i.e. it is organized around main data-subjects of a company. → ODS is integrated i.e. it is a collection of data from a variety of systems. → ODS is volatile i.e. data changes frequently, as new information refreshes ODS. → ODS is current-valued i.e. it is up-to-date & reflects current status of information. → ODS is detailed i.e. it is detailed enough to serve needs of manager. ODS DESIGN & IMPLEMENTATION • The extraction of information from source-databases should be efficient. • The quality of data should be maintained (Figure 8.1). • Suitable-checks are required to ensure quality of data after each refresh. • The ODS is required to → satisfy integrity constraints. Ex: existential-integrity, referential-integrity. → take appropriate actions to deal with null values. • ODS is a read only database i.e. users shouldn‟t be allowed to update information. • Populating an ODS involves an acquisition-process of extracting, transforming & loading data from source systems. This process is called ETL. (ETL = Extraction, Transformation and Loading). • Before an ODS can go online, following 2 tasks must be completed: i) Checking for anomalies & ii) Testing for performance. • Why an ODS should be separate from the operational-databases? Ans: Because from time to time, complex queries are likely to degrade performance of OLTP systems. The OLTP systems have to provide a quick response to operational-users. The businesses cannot afford to have response-time suffer when a manager is running a complex-query. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  3. 3. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-2 1b. What is ETL? Explain the steps in ETL. (07 Marks) Ans: ETL (EXTRACTION, TRANSFORMATION & LOADING) • The ETL process consists of → data-extraction from source systems → data-transformation which includes data-cleaning & → data-loading in the ODS or the data-warehouse. • Data cleaning deals with detecting & removing errors/inconsistencies from the data. • Most often, the data is sourced from a variety of systems. PROBLEMS TO BE SOLVED FOR BUILDING INTEGRATED-DATABASE 1) Instance Identity Problem • The same customer may be represented slightly differently in different source- systems. 2) Data-errors • Different types of data-errors include: i) There may be some missing attribute-values. ii) There may be duplicate records. 3) Record Linkage Problem • This deals with problem of linking information from different databases that relates to the same customer. 4) Semantic Integration Problem • This deals with integration of information found in heterogeneous-OLTP & legacy sources. • For example, → Some of the sources may be relational. → Some sources may be in text documents. → Some data may be character strings or integers. 5) Data Integrity Problem • This deals with issues like i) referential integrity ii) null values & iii) domain of values. STEPS IN DATA CLEANING 1) Parsing • This involves → identifying various components of the source-files and → establishing the relationships b/w i) components of source-files & ii) fields in the target-files. • For ex: identifying the various components of a person„s name and address. 2) Correcting • Correcting the identified-components is based on sophisticated techniques using mathematical algorithms. • Correcting may involve use of other related information that may be available in the company. 3) Standardizing • Business rules of the company are used to transform data to standard form. • For ex, there might be rules on how name and address are to be represented. 4) Matching • Much of the data extracted from a number of source-systems is likely to be related. Such data needs to be matched. 5) Consolidating • All corrected, standardized and matched data can now be consolidated to build a single version of the company-data. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  4. 4. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-3 1c. What are the guide lines for implementing the data-warehouse. (05 Marks) Ans: DW IMPLEMENTATION GUIDELINES Build Incrementally • Firstly, a data-mart will be built. • Then, a number of other sections of the company will be built. • Then, the company data-warehouse will be implemented in an iterative manner. • Finally, all data-marts extract information from the data-warehouse. Need a Champion • The project must have a champion who is willing to carry out considerable research into following: i) Expected-costs & ii) Benefits of project. • The projects require inputs from many departments in the company. • Therefore, the projects must be driven by someone who is capable of interacting with people in the company. Senior Management Support • The project calls for a sustained commitment from senior-management due to i) The resource intensive nature of the projects. ii) The time the projects can take to implement. Ensure Quality • Data-warehouse should be loaded with i) Only cleaned data & ii) Only quality data. Corporate Strategy • The project must fit with i) corporate-strategy & ii) business-objectives. Business Plan • All stakeholders must have clear understanding of i) Project plan ii) Financial costs & iii) Expected benefits. Training • The users must be trained to i) Use the data-warehouse & ii) Understand capabilities of data-warehouse. Adaptability • Project should have build-in adaptability, so that changes may be made to DW as & when required. Joint Management • The project must be managed by both i) IT professionals of software company & ii) Business professionals of the company. 2a. Distinguish between OLTP and OLAP. (04 Marks) Ans: For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  5. 5. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-4 2b. Explain the operation of data-cube with suitable examples. (08 Marks) ROLL-UP • This is like zooming-out on the data-cube. (Figure 2.1a). • This is required when the user needs further abstraction or less detail. • Initially, the location-hierarchy was "street < city < province < country". • On rolling up, the data is aggregated by ascending the location-hierarchy from the level-of- city to level-of-country. Figure 2.1a: Roll-up operation DRILL DOWN • This is like zooming-in on the data. (Figure 2.1b). • This is the reverse of roll-up. • This is an appropriate operation → when the user needs further details or → when the user wants to partition more finely or → when the user wants to focus on some particular values of certain dimensions. • This adds more details to the data. • Initially, the time-hierarchy was "day < month < quarter < year”. • On drill-up, the time dimension is descended from the level-of-quarter to the level-of-month. Figure 2.1b: Drill-down operation For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  6. 6. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-5 PIVOT (OR ROTATE) • This is used when the user wishes to re-orient the view of the data-cube. (Figure 2.1c). • This may involve → swapping the rows and columns or → moving one of the row-dimensions into the column-dimension. Figure 2.1c: Pivot operation SLICE & DICE • These are operations for browsing the data in the cube. • These operations allow ability to look at information from different viewpoints. • A slice is a subset of cube corresponding to a single value for 1 or more members of dimensions. (Figure 2.1d). • A dice operation is done by performing a selection of 2 or more dimensions. (Figure 2.1e). Figure 2.1d: Slice operation Figure 2.1e: Dice operation For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  7. 7. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-6 2c. Write short note on (08 Marks) i) ROLAP ii) MOLAP iii) FASMI iv) DATACUBE Ans:(i) For answer, refer Solved Paper June-2014 Q.No.2b. Ans:(ii) For answer, refer Solved Paper June-2014 Q.No.2b. Ans:(iii) For answer, refer Solved Paper June-2015 Q.No.2a. Ans:(iv) For answer, refer Solved Paper Dec-2014 Q.No.2a. 3a. Discuss the tasks of data-mining with suitable examples. (10 Marks) Ans: DATA-MINING • Data-mining is the process of automatically discovering useful information in large data- repositories. DATA-MINING TASKS 1) Predictive Modeling • This refers to the task of building a model for the target-variable as a function of the explanatory-variable. • The goal is to learn a model that minimizes the error between i) Predicted values of target-variable and ii) True values of target-variable (Figure 3.1). • There are 2 types: i) Classification: is used for discrete target-variables Ex: Web user will make purchase at an online bookstore is a classification-task. ii) Regression: is used for continuous target-variables. Ex: forecasting the future price of a stock is regression task. Figure 3.1: Four core tasks of data-mining For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  8. 8. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-7 2) Association Analysis • This is used to find group of data that have related functionality. • The goal is to extract the most interesting patterns in an efficient manner. • Ex: Market based analysis We may discover the rule {diapers} -> {Milk} This suggests that customers who buy diapers also tend to buy milk. • Useful applications: i) Finding groups of genes that have related functionality. ii) Identifying web pages that are accessed together. 3) Cluster Analysis • This seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters. • Useful applications: i) To group sets of related customers. ii) To find areas of the ocean that has a significant impact on Earth's climate. • For example: Collection of news articles in Table 1.2 shows → First 4 rows speak about economy & → Last 2 lines speak about health sector. 4) Anomaly Detection • This is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies. • The goal is to i) Discover the real anomalies & ii) Avoid falsely labeling normal objects as anomalous. • Useful applications: i) Detection of fraud & ii) Network intrusions. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  9. 9. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-8 3b. Explain shortly any five data pre-processing approaches. (10 Marks) Ans: DATA PRE-PROCESSING • Data pre-processing is a data-mining technique that involves transforming raw data into an understandable format. Q: Why data pre-processing is required? • Data is often collected for unspecified applications. • Data may have quality-problems that need to be addressed before applying a DM- technique. For example: 1) Noise & outliers 2) Missing values & 3) Duplicate data. • Therefore, preprocessing may be needed to make data more suitable for data-mining. DATA PRE-PROCESSING APPROACHES 1. Aggregation 2. Dimensionality reduction 3. Variable transformation 4. Sampling 5. Feature subset selection 6. Discretization & binarization 7. Feature Creation 1) AGGREGATION • This refers to combining 2 or more attributes into a single attribute. For example, merging daily sales-figures to obtain monthly sales-figures. Purpose 1) Data reduction: Smaller data-sets require less processing time & less memory. 2) Aggregation can act as a change of scale by providing a high-level view of the data instead of a low-level view. E.g. Cities aggregated into districts, states, countries, etc. 3) More “stable” data: Aggregated data tends to have less variability. • Disadvantage: The potential loss of interesting-details. 2) DIMENSIONALITY REDUCTION • Key Benefit: Many DM algorithms work better if the dimensionality is lower. Curse of Dimensionality • Data-analysis becomes much harder as the dimensionality of the data increases. • As a result, we get i) reduced classification accuracy & ii) poor quality clusters. Purpose • Avoid curse of dimensionality. • May help to i) Eliminate irrelevant features & ii) Reduce noise. • Allow data to be more easily visualized. • Reduce amount of time and memory required by DM algorithms. 3) VARIABLE TRANSFORMATION • This refers to a transformation that is applied to all the values of a variable. Ex: converting a floating point value to an absolute value. • Two types are: 1) Simple Functions • A simple mathematical function is applied to each value individually. • For ex: If x is a variable, then transformations may be ex , 1/x, log(x) 2) Normalization (or Standardization) • The goal is to make an entire set of values have a particular property. • If x is the mean of the attribute-values and sx is their standard deviation, then the transformation x'=(x- x )/sx creates a new variable that has a mean of 0 and a standard-deviation of 1. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  10. 10. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-9 4) SAMPLING • This is a method used for selecting a subset of the data-objects to be analyzed. • This is used for i) Preliminary investigation of the data & ii) Final data analysis. • Q: Why sampling? Ans: Obtaining & processing the entire-set of “data of interest” is too expensive or time consuming. • Three sampling methods: i) Simple Random Sampling • There is an equal probability of selecting any particular object. • There are 2 types: a) Sampling without Replacement • As each object is selected, it is removed from the population. b) Sampling with Replacement • Objects are not removed from the population, as they are selected for the sample. The same object can be picked up more than once. ii) Stratified Sampling • This starts with pre-specified groups of objects. • Equal numbers of objects are drawn from each group. iii) Progressive Sampling • This method starts with a small sample, and then increases the sample-size until a sample of sufficient size has been obtained. 5) FEATURE SUBSET SELECTION • To reduce the dimensionality, use only a subset of the features. • Two types of features: 1) Redundant Features duplicate much or all of the information contained in one or more other attributes. For ex: price of a product (or amount of sales tax paid). 2) Irrelevant Features contain almost no useful information for the DM task at hand. For ex: student USN is irrelevant to task of predicting student‟s marks. • Three techniques: 1) Embedded Approaches • Feature selection occurs naturally as part of DM algorithm. 2) Filter Approaches • Features are selected before the DM algorithm is run. 3) Wrapper Approaches • Use DM algorithm as a black box to find best subset of attributes. 6) DISCRETIZATION AND BINARIZATION • Classification-algorithms require that the data be in the form of categorical attributes. • Association analysis algorithms require that the data be in the form of binary attributes. • Transforming continuous attributes into a categorical attribute is called discretization. And transforming continuous & discrete attributes into binary attributes is called as binarization. • The discretization process involves 2 subtasks: i) Deciding how many categories to have and ii) Determining how to map the values of the continuous attribute to the categories. 7) FEATURE CREATION • This creates new attributes that can capture the important information in a data-set much more efficiently than the original attributes. • Three general methods: 1) Feature Extraction • Creation of new set of features from the original raw data. 2) Mapping Data to New Space • A totally different view of data that can reveal important and interesting features. 3) Feature Construction • Combining features to get better features than the original. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  11. 11. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-10 4a. Develop the Apriori Alogorithm for generating frequent-itemset. (08 Marks) Ans: APRIORI ALOGORITHM FOR GENERATING FREQUENT-ITEMSET • Let Ck = set of candidate k-itemsets. Let Fk = set of frequent k-itemsets. • The algorithm initially makes a single pass over the data-set to determine the support of each item. After this step, the set of all frequent 1-itemsets, F1, will be known (steps 1 & 2). • Next, the algorithm will iteratively generate new candidate k-itemsets using frequent (k - 1) - itemsets found in the previous iteration (step 5). Candidate generation is implemented using a function called apriori-gen. • To count the support of the candidates, the algorithm needs to make an additional pass over the data-set (steps 6–10). The subset function is used to determine all the candidate itemsets in Ck that are contained in each transaction „t‟. • After counting their supports, the algorithm eliminates all candidate itemsets whose support counts are less than minsup (step 12). • The algorithm terminates when there are no new frequent-itemsets generated. 4b. What is association analysis? (04 Marks) Ans: ASSOCIATION ANALYSIS • This is used to find group of data that have related functionality. • The goal is to extract the most interesting patterns in an efficient manner. • Ex: Market based analysis We may discover the rule {diapers} -> {Milk} which suggests that customers who buy diapers also tend to buy milk. • Useful applications: i) Finding groups of genes that have related functionality. ii) Identifying web pages that are accessed together. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  12. 12. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-11 4c. Consider the transaction data-set: Construct the FP tree by showing the trees separately after reading each transaction. (08 Marks) Ans: Procedure: 1. A scan of T1 derives a list of frequent-items, h(a:8); (b:5); (c:3); (d:1); etc in which items are ordered in frequency descending order. 2. Then, the root of a tree is created and labeled with “null”. The FP-tree is constructed as follows: (a) The scan of the first transaction leads to the construction of the first branch of the tree: h(a:1), (b:1) (Figure 6.24i). The frequent-items in the transaction are listed according to the order in the list of frequent-items. (b) For the third transaction (Figure 6.24iii). → since its (ordered) frequent-item list a, c, d, e shares a common prefix „a‟ with the existing path a:b: → the count of each node along the prefix is incremented by 1 and → three new nodes (c:1), (d:1), (e:1) is created and linked as a child of (a:2) (c) For the seventh transaction, since its frequent-item list contains only one item i.e. „a‟ shares only the node „a‟ with the f-prefix subtree, a‟s count is incremented by 1. (d) The above process is repeated for all the transactions. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  13. 13. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-12 5a. Explain Hunts Algorithm and illustrate is working? (08 Marks) Ans: HUNT’S ALGORITHM • A decision-tree is grown in a recursive fashion. • Let Dt = set of training-records that are associated with node „t‟. Let y = {y1, y2, . . . , yc} be the class-labels. • Hunt‟s algorithm is as follows: Step 1: • If all records in Dt belong to same class yt, then t is a leaf node labeled as yt. Step 2: • If Dt contains records that belong to more than one class, an attribute test- condition is selected to partition the records into smaller subsets. • A child node is created for each outcome of the test-condition and the records in Dt are distributed to the children based on the outcomes. • The algorithm is then recursively applied to each child node. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  14. 14. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-13 EXPLANATION OF DECISION-TREE CONSTRUCTION 1) The initial tree for the classification problem contains a single node with class-label Defaulted=No (Fig 4.7(a)). 2) The records are subsequently divided into smaller subsets based on the outcomes of the Home Owner test-condition. 3) Hunt's algorithm is then applied recursively to each child of the root node. 4) The left child of the root is therefore a leaf node labeled Defaulted=No (Fig 4.7(b)). 5) For the right child, we need to continue applying the recursive step of Hunt's algorithm until all the records belong to the same class. TWO DESIGN ISSUES OF DECISION-TREE 1. How should the training-records be split? The algorithm must provide i) a method for specifying test-condition for different attribute-types. ii) an objective-measure for evaluating goodness of each test-condition. 2. How should the splitting procedure stop? A possible strategy is to continue expanding a node until either i) All the records belong to the same class or ii) All the records have identical attribute values. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  15. 15. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-14 5b. What is rule based classifier? Explain how a rule classifier works. (08 Marks) Ans: RULE-BASED CLASSIFICATION • A rule-based classifier is a technique for classifying records. • This uses a set of “if..then..” rules. • The rules are represented as R = (r1∨r2∨. . . rk) where R = rule-set ri‟s = classification-rules • General format of a rule: ri: (conditioni) −→ yi. where conditioni = conjunctions of attributes (A1 op v1) ∧ (A2 op v2) ∧ . . . (Ak op vk) y = class-label LHS = rule antecedent contains a conjunction of attribute tests i.e. (Aj op vj) RHS = rule consequent contains the predicted class yi op = logical operators such as =, !=, , ≤, ≥ • For ex: Rule R1 is R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds • Given a data-set D and a rule r: A → y. The quality of a rule can be evaluated using following two measures: i) Coverage is defined as the fraction of records in D that trigger the rule r. ii) Accuracy is defined as fraction of records triggered by r whose class-labels are equal to y. i.e. where |A| = no. of records that satisfy the rule antecedent. |A ∩ y| = no. of records that satisfy both the antecedent and consequent. |D| = total no. of records. HOW A RULE-BASED CLASSIFIER WORKS • A rule-based classifier divides a test-record based on the rule triggered by the record. • Consider the rule-set given below R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians • Consider the following vertebrates in below table: • A lemur triggers rule R3, so it is classified as a mammal. • A turtle triggers the rules R4 and R5. Since the classes predicted by the rules are contradictory (reptiles versus amphibians), their conflicting classes must be resolved. • None of the rules are applicable to a dogfish shark. • Characteristics of rule-based classifier are: 1) Mutually Exclusive Rules • Classifier contains mutually exclusive rules if the rules are independent of each other. • Every record is covered by at most one rule. • In the above example, → lemur is mutually exclusive as it triggers only one rule R3. → dogfish is mutually exclusive as it triggers no rule. → turtle is not mutually exclusive as it triggers more than one rule i.e. R4, R5. 2) Exhaustive Rules • Classifier has exhaustive coverage if it accounts for every possible combination of attribute values. • Each record is covered by at least one rule. • In the above example, → lemur and turtle are Exhaustive as it triggers at least one rule. → dogfish is not exhaustive as it does not triggers any rule. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  16. 16. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-15 5c. Write the algorithm for k-nearest neighbour classification. (04 Marks) Ans: For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  17. 17. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-16 6a. What is Bayes Theorm? Show how it is used for classification (06 Marks) Ans: BAYES THEORM • Bayes theorem is a statistical principle for combining prior knowledge of the classes with new evidence gathered from data. • Let X and Y be a pair of random-variables. • A conditional probability P(X=x | Y=y) is the probability that a random-variable will take on a particular value given that the outcome for another random-variable is known. • The Bayes theorem is given by • The Bayes theorem can be used to solve the prediction problem. • Two implementations of Bayesian methods are used: 1. Naive Bayes classifier & 2. Bayesian belief network. NAIVE BAYES CLASSIFIER • A naive Bayes classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent. • The conditional independence assumption can be formally stated as follows: where each attribute-set X = {X1,X2, . . . ,Xd} consists of „d‟ attributes Conditional Independence • Let X, Y, and Z denote three sets of random-variables. • The variables in X are said to be conditionally independent of Y, given Z, if the following condition holds: HOW A NAIVE BAYES CLASSIFIER WORKS • With the conditional independence assumption, → instead of computing the class-conditional probability for every combination of X, → we only have to estimate the conditional probability of each Xi. • This approach is more practical because it does not require a very large training-set to obtain a good estimate of probability. • To classify a test-record, the naive Bayes classifier computes the posterior probability for each class Y: Estimating Conditional Probabilities for Categorical Attributes • For a categorical attribute Xi, the conditional probability P(Xi =xi | Y= y) is estimated according to the fraction of training instances in class y that take on a particular attribute value xi. Estimating Conditional Probabilities for Continuous Attributes • There are 2 ways to estimate the class-conditional probabilities: 1) We can discretize each continuous attribute and then replace the continuous attribute value with its corresponding discrete interval. This approach transforms the continuous attributes into ordinal attributes. The conditional probability P(Xi | Y =y) is estimated by computing the fraction of training-records belonging to class y that falls within the corresponding interval for Xi. 2) We can assume a certain form of probability distribution for the continuous variable and estimate the parameters of the distribution using the training data. A Gaussian distribution is usually chosen to represent the class- conditional probability for continuous attributes. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  18. 18. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-17 6b. Discuss methods for estimating predictive accuracy of classification. (10 Marks) Ans: PREDICTIVE ACCURACY • This refers to the ability of the model to correctly predict the class-label of new or previously unseen data. • A confusion-matrix that summarizes the no. of instances predicted correctly or incorrectly by a classification-model is shown in Table 5.6. METHODS FOR ESTIMATING PREDICTIVE ACCURACY 1) Sensitivity 2) Specificity 3) Recall & 4) Precision • Let True positive (TP) = no. of positive examples correctly predicted. False negative (FN) = no. of positive examples wrongly predicted as negative. False positive (FP) = no. of negative-examples wrongly predicted as positive. True negative (TN) = no. of negative-examples correctly predicted. • The true positive rate (TPR) or sensitivity is defined as the fraction of positive examples predicted correctly by the model, i.e. Similarly, the true negative rate (TNR) or specificity is defined as the fraction of negative- examples predicted correctly by the model, i.e. • Finally, the false positive rate (FPR) is the fraction of negative-examples predicted as a positive class, i.e. Similarly, the false negative rate (FNR) is the fraction of positive examples predicted as a negative class, i.e. • Recall and precision are two widely used metrics employed in applications where successful detection of one of the classes is considered more significant than detection of the other classes. i.e. • Precision determines the fraction of records that actually turns out to be positive in the group the classifier has declared as a positive class. • Recall measures the fraction of positive examples correctly predicted by the classifier. • Weighted accuracy measure is defined by the following equation. 6c. What are the two approaches for extending the binary-classifiers to extend to handle multi class problems. (04 Marks) Ans: For answer, refer Solved Paper June-2014 Q.No.6b. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  19. 19. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-18 7a. List and explain four distance measures to compute the distance between a pair of points and find out the distance between two objects represented by attribute values (1,6,2,5,3) & (3,5,2,6,6) by using any 2 of the distance measures (08 Marks) Ans: 1) EUCLIDEAN DISTANCE • This metric is most commonly used to compute distances. • The largest valued-attribute may dominate the distance. • Requirement: The attributes should be properly scaled. • This metric is more appropriate when the data is not standardized. 2) MANHATTAN DISTANCE • In most cases, the result obtained by this measure is similar to those obtained by using the Euclidean distance. • The largest valued attribute may dominate the distance. 3) CHEBYCHEV DISTANCE • This metric is based on the maximum attribute difference. 4) CATEGORICAL DATA DISTANCE • This metric may be used if many attributes have categorical values with only a small number of values (e.g. metric binary values). where N=total number of categorical attributes Solution: Given, (x1,x2,x3,x4,x5) = (1, 6, 2, 5, 3) (y1,y2,y3,y4,y5) = (3, 5, 2, 6, 6) Euclidean Distance is = 3.872983 Manhattan Distance is D(x,y) = |x1-y1|+|x2-y2|+|x3-y3|+|x4-y4|+|x5-y5| =7 Chebychev Distance is D(x,y) = Max ( |x1-y1|,|x2-y2|,|x3-y3|,|x4-y4|,|x5-y5| ) = 3 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  20. 20. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-19 7b. Explain the cluster analysis methods briefly. (08 Marks) Ans: CLUSTER ANALYSIS METHODS Partitional Method • The objects are divided into non-overlapping clusters (or partitions) such that each object is in exactly one cluster (Figure 4.1a). • The method obtains a single-level partition of objects. • The analyst has to specify i) Number of clusters prior (k) and ii) Starting seeds of clusters. • Analyst has to use iterative approach in which he runs the method many times → specifying different numbers of clusters & different starting seeds & → then selecting the best solution. • The method converges to a local minimum rather than the global minimum. Figure 4.1a Figure 4.1b Hierarchical Methods • A set of nested clusters is organized as a hierarchical tree (Figure 4.1b). • Two types: 1. Agglomerative: This starts with each object in an individual cluster & then tries to merge similar clusters into larger clusters. 2. Divisive: This starts with one cluster & then splits into smaller clusters. • Tentative clusters may be merged or split based on some criteria. Density based Methods • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high-density. • Typically, for each data-point in a cluster, at least a minimum number of points must exist within a given radius. • The method can deal with arbitrary shape clusters. Grid-based Methods • The object-space rather than the data is divided into a grid. • This is based on characteristics of the data. • The method can deal with non-numeric data more easily. • The method is not affected by data-ordering. Model-based Methods • A model is assumed, perhaps based on a probability distribution. • Essentially, the algorithm tries to build clusters with → a high level of similarity within them & → a low level of similarity between them. • Similarity measurement is based on the mean values. • The algorithm tries to minimize the squared error function. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  21. 21. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-20 7c. What are the features of cluster analysis (04 Marks) Ans: DESIRED FEATURES OF CLUSTER ANALYSIS METHOD Scalability • Data-mining problems can be large. • Therefore, a cluster-analysis method should be able to deal with large problems gracefully. • The method should be able to deal with datasets in which number of attributes is large. Only one Scan of the Dataset • For large problems, data must be stored on disk. • So, cost of I/O disk becomes significant in solving the problem. • Therefore, the method should not require more than one scan of disk. Ability to Stop & Resume • For large dataset, cluster-analysis may require huge processor-time to complete the task. • Therefore, the task should be able to be stopped & then resumed as & when required. Minimal Input Parameters • The method should not expect too much guidance from the data-mining analyst. • Therefore, the analyst should not be expected → to have domain knowledge of data and → to posses‟ insight into clusters. Robustness • Most data obtained from a variety of sources has errors. • Therefore, the method should be able to deal with i) noise, ii) outlier & iii) missing values gracefully. Ability to Discover Different Cluster-Shapes • Clusters appear in different shapes and not all clusters are spherical. • Therefore, method should be able to discover cluster-shapes other than spherical. Different Data Types • Many problems have a mixture of data types, for e.g. numerical, categorical & textual. • Therefore, the method should be able to deal with i) Numerical data ii) Boolean data & iii) Categorical data. Result Independent of Data Input Order • Irrespective of input-order, result of cluster-analysis of the same data should be same. • Therefore, the method should not be sensitive to data input-order. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  22. 22. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-21 8. Write short note on the following: (20 Marks) a. Text mining b. Spatial-databases mining c. Mining temporal databases d. Web content mining Ans (a): TEXT MINING • This is concerned with extraction of info implicitly contained in the collection of documents. • Text-collection lacks the imposed structure of a traditional database. • The text expresses a vast range of information. (DM = Data-mining, TM = Text-Mining). • The text encodes the information in a form that is difficult to decipher automatically. • Traditional DM techniques are designed to operate on structured-databases. • In structured-databases, it is easy to define the set of items and hence, it is easy to use the traditional DM techniques. In textual-database, identifying individual items (or terms) is a difficult task. • TM techniques have to be developed to process the unstructured textual-data. • The inherent nature of textual-data motivates the development of separate TM techniques. For ex, unstructured characteristics. • Two approaches for text-mining: 1) Impose a structure on the textual-database and use any of the known DM techniques meant for structured-databases. 2) Develop a very specific technique for mining that exploits the inherent characteristics of textual-databases. Ans (b): SPATIAL-DATABASES MINING • This refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial-databases. • Consider a map of the city of Mysore containing clusters of points. (Where each point marks the location of a particular house). We can mine varieties of information by identifying likely-relationships. For ex, "the land-value of cluster of residential area around „Mysore Palace‟ is high". Such information could be useful to investors, or prospective home buyers. SPATIAL MINING TASKS 1) Spatial-characteristic Rule • This is a general description of spatial-data. • For example, a rule may describe the general price-ranges of houses in various geographic regions. 2) Spatial-discriminant Rule • This is a general description of the features discriminating a class of spatial- data from other classes. • For example, the comparison of price-range of houses in different geographical regions. 3) Spatial Association Rules • These describe the association between spatially related objects. • We can associate spatial attributes with non-spatial attributes. • For example, "the monthly rental of houses around the market area is mostly Rs 500 per sq mt." 4) Attribute-oriented Induction • The concept hierarchies of spatial and non-spatial attributes can be used to determine relationships between different attributes. • For ex, one may be interested in a particular category of land-use patterns. A built-up area may be a recreational facility or a residential complex. Similarly, a recreational facility may be a cinema or a restaurant. 5) Aggregate Proximity Relationships • This problem is concerned with relationships between spatial-clusters based on spatial and non-spatial attributes. • Given „n‟ input clusters, we want to associate the clusters with classes of features. • For example, educational institutions which, in turn, may be comprised of secondary schools and junior colleges or higher institutions. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  23. 23. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013 A-22 Ans (c): MINING TEMPORAL DATABASES • This can be defined as non-trivial extraction of potentially-useful & previously-unrecorded information with an implicit/explicit temporal-content, from large quantities of data. • This has the capability to infer causal and temporal-proximity relationships. FOUR TYPES OF TEMPORAL-DATA 1) Static • Static-data are free of any temporal-reference. • Inferences derived from static-data are also free of any temporality. 2) Sequences (Ordered Sequences of Events) • There may not be any explicit reference to time. • There exists a temporal-relationship between data-items. • For example, market-basket transaction. 3) Timestamped • The temporal-information is explicit. • The relationship can be quantitative, in the sense that → we can say the exact temporal-distance between the data-elements & → we can say that one transaction occurred before another. • For example, census data, land-use data etc. • Inferences derived from this data can be temporal or non-temporal. 4) Fully Temporal • The validity of the data-element is time-dependent. • Inferences derived from this data are necessarily temporal. TEMPORAL DATA-MINING TASKS 1) Temporal Association • We attempt to discover temporal-associations b/w non-temporal itemsets. • For example, "70% of the readers who buy a DBMS book also buy a Data- mining book after a semester". 2) Temporal Classification • We can extend concept of decision-tree construction on temporal-attributes. • For example, a rule could be: "The first case of malaria is normally reported after the first pre-monsoon rain and during the months of May-August". 3) Trend Analysis • The analysis of one or more time series of continuous data may show similar trends i.e. similar shapes across the time axis. • For example, "The deployment of the Android OS is increasingly becoming popular in the Smartphone industry". Ans (d): WEB CONTENT MINING • This is the process of extracting useful information from the contents of web-documents. • In recent years, → government information are gradually being placed on the web. → users access digital libraries from the web. → users access web-applications through web-interfaces. • Some of the web-data are hidden-data, and some are generated dynamically. • The web-content consists of different types of data such as text, image, audio & video. • Most of the research on web-mining is focused on the text-contents. • The textual-parts of web-data consist of i) Unstructured-data. For ex: free texts ii) Semi structured-data. For ex: HTML documents iii) Structured-data. For ex: data in the tables • Much of the web-data is unstructured, free text-data. As a result, text-mining techniques can be directly employed for web-mining. • Issues addressed in text mining are: → topic discovery → extracting association patterns → clustering of web documents & → classification of Web Pages. • Research activities have drawn techniques of other disciplines such as i) IR and ii) NLP. (IR = Information Retrieval, NLP = Natural Language Processing). For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  24. 24. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  25. 25. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  26. 26. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-1 1a. Explain ODS and its structure with a neat figure. (07 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.1a. 1b. Explain the implementation steps for data-warehouse. (07 Marks) Ans: DW IMPLEMENTATION STEPS 1) Requirements Analysis & Capacity Planning • This step involves → defining needs of the company → defining architecture → carrying out capacity-planning & → selecting the hardware & software tools. • This step also involves consulting → with senior-management & → with the various stakeholders. 2) Hardware Integration • Both hardware and software need to be put together by integrating → servers → storage devices & → client software tools. 3) Modeling • This involves designing the warehouse schema and views. • This may involve using a modeling tool if the data-warehouse is complex. 4) Physical Modeling • This involves designing → data-warehouse organization → data placement → data partitioning & → deciding on access methods & indexing. 5) Sources • This involves identifying and connecting the sources using gateways. 6) ETL • This involves → identifying a suitable ETL tool vendor → purchasing the tool & → implementing the tool. • This may include customizing the tool to suit the needs of the company. 7) Populate DW • This involves testing the required ETL-tools using a staging-area. • Then, ETL-tools are used for populating the warehouse. 8) User Applications • This involves designing & implementing applications required by end-users. 9) Roll-out the DW and Applications 1c. Write the differences between OLTP and data-warehouse. (06 Marks) Ans: For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  27. 27. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-2 2a. Explain characteristics of OLAP & write comparison of OLTP & OLAP. (12 Marks) Ans: CHARACTERISTICS OF OLAP SYSTEMS 1) Users • OLTP systems are designed for many office-workers, say 100-1000 users. Whereas, OLAP systems are designed for few decision-makers. 2) Functions • OLTP systems are mission-critical. They support the company's day-to-day operations. They are mostly performance-driven. Whereas, OLAP systems are management-critical. They support the company's decision-functions using analytical-investigations. 3) Nature • OLTP systems are designed to process one record at a time, for ex, a record related to the customer. Whereas, OLAP systems → involve queries to deal with many records at a time & → provide aggregate data to a manager. 4) Design • OLTP systems are designed to be application-oriented. Whereas, OLAP systems are designed to be subject-oriented. • OLTP systems view the operational-data as a collection of tables. Whereas, OLAP systems view operational-information as multidimensional model. 5) Data • OLTP systems deal only with the current-status of information. • The old information → may have been archived & → may not be accessible online. Whereas, OLAP systems require historical-data over several years. 6) Kind of use • OLTP systems are used for read & write operations. Whereas, OLAP systems normally do not update the data. COMPARISON OF OLTP & OLAP For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  28. 28. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-3 2b. Explain ROLAP & MOLAP. (08 Marks) Ans: ROLAP • This uses relational or extended-relational DBMS to store & manage data of warehouse. • This can be considered a bottom-up approach to OLAP. • This is based on using a data-warehouse which is designed using a star scheme. • Data-warehouse provides multidimensional capabilities. • In DW, data is represented in i) fact-table & ii) dimension-table. • The fact-table contains → one column for each dimension & → one column for each measure. • Every row of the fact-table provides one fact. • An OLAP tool is used to manipulate the data in the DW tables. • OLAP tool → groups the fact-table to find aggregates & → uses some of the aggregates already computed to find new aggregates. • Advantages: 1) More easily used with existing relational DBMS. 2) Data can be stored efficiently using tables. 3) Greater scalability. • Disadvantage: 1) Poor query-performance. • Some products are i) Oracle OLAP mode & ii) OLAP Discoverer. MOLAP • This is based on using a multidimensional DBMS. • The multidimensional DBMS is used to store & access data. • This can be considered as a top-down approach to OLAP. • This does not have a standard approach to storing and maintaining the data. • This uses special-purpose file-indexes. • The file-indexes store pre-computation of all aggregations in the data-cube. • Advantages: 1) Implementation is efficient. 2) Easier to use and therefore more suitable for inexperienced users. 3) Fast indexing to pre-computed summarized-data. • Disadvantages: 1) More expensive than ROLAP. 2) Data is not always current. 3) Difficult to scale a MOLAP system for very large problems. 4) Storage-utilization may be low if the data-set is sparse. • Some products are i) Hyperion Essbase & ii) Applix iTM1. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  29. 29. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-4 3a. Explain 4 types of attributes with statistical operations & examples. (06 Marks) Ans: 3b. Explain the steps applied in data pre-processing. (10 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.3b. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  30. 30. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-5 3c. Two binary vectors are given below: (04 Marks) X = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) Y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1) Calculate (i) SMC (ii) Jaccord similarly coefficient and hamming distance. Ans: Solution: Let X = (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) Y = (y1, y2, y3, y4, y5, y6, y7, y8, y9, y10) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1) Hamming Distance is given by D(x,y)= |x1-y1|+|x2-y2|+|x3-y3|+|x4-y4|+|x5-y5|+|x6-y6|+|x7-y7|+|x8-y8|+|x9-y9|+|x10-y10| = 3 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  31. 31. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-6 4a. Consider the following transaction data-set 'D' shows 9 transactions and list of items using Apriori algorithm frequent-itemset minimum support = 2 (10 Marks) Ans: Step 1: Generating 1-itemset frequent-pattern. Step 2: Generating 2-itemset frequent-pattern. Step 3: Generating 3-itemset frequent-pattern. Step 4: Generating 4-itemset frequent-pattern. • The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. • Thus, C4 = φ, and algorithm terminates. 4b. For the following transaction data-set table, construct an FP tree and explain stepwise for all the transaction. (10 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.4b. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  32. 32. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-7 5a. Define classification. Draw a neat figure and explain general approach for solving classification-model. (06 Marks) Ans: CLASSIFICATION • Classification is the task of learning a target-function f that maps each attribute-set x to one of the predefined class-labels y. • The target-function is also known informally as a classification-model. • A classification-model is useful for the following purposes: 1) Descriptive Modeling • A classification-model can serve as an explanatory-tool to distinguish between objects of different classes. • For example, it is useful for biologists to have a descriptive model. 2) Predictive Modeling • A classification-model can be used to predict the class-label of unknown-records. GENERAL APPROACH TO SOLVING A CLASSIFICATION PROBLEM • First, a training-set consisting of records whose class-labels are known must be provided. • The training-set is used to build a classification-model. • The classification-model is applied to the test-set. • The test-set consists of records with unknown class-labels (Figure 4.3). • Evaluation of the performance of a classification-model is based on the counts of test- records correctly and incorrectly predicted by the model. These counts are tabulated in a Confusion-matrix (Table 4.2). • Each entry fij in matrix denotes the number of records from class i predicted to be of class j. For instance, f01 is the number of records from class 0 incorrectly predicted as class 1. • Accuracy is defined as: • Error rate is defined as: For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  33. 33. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-8 5b. Mention the three impurity measures for selecting best splits. (04 Marks) Ans: where c = no. of classes p = fraction of records that belong to one of the 2 classes 5c. Consider a training-set that contains 60 +ve examples and 100 -ve examples, for each of the following candidate rules. Rule r1: Covers 50 +ve examples and 5 -ve examples. Rule r2: Covers 2 +ve examples and no -ve examples. Determine which is the best and worst candidate rule according to, i) Rule accuracy ii) Likelihood ratio statistic. iii) Laplace measure. (10 Marks) Ans: (i) Rule accuracy is given by Rule accuracy = f+/n where n = no. of examples covered by rule. f+= no. of positive-examples covered by rule. For r1: Given, f+=50, n=55 Rule accuracy is 50/55=90.9%. For r2: Given, f+=2, n=2 Rule accuracy is 2/2=100%. Therefore, r1 is the best candidate and r2 is the worst candidate according to rule accuracy. (ii) Likelihood ratio statistic is given by where k = no. of classes fi = no. of positive-examples covered by rule ei= expected frequency of a rule that makes random predictions For r1: Expected frequency for positive-class is e+ = 55×60/160 = 20.625 Expected frequency for negative class is e− = 55 × 100/160 = 34.375 Therefore, the likelihood ratio is R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)] = 99.9. For r2: The expected frequency for the positive-class is e+ = 2 × 60/160 = 0.75 and the expected frequency for the negative class is e− = 2 × 100/160 = 1.25. Therefore, the likelihood ratio is R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66. Therefore, r1 is best candidate and r2 is worst candidate according to likelihood ratio statistic. (iii) Laplace measure is given by where n = no. of examples covered by rule f+= no. of positive-examples covered by rule k = total number of classes For r1: Laplace measure is (50+1)/(55+2)=51/57 = 89.47%, For r2: Laplace measure is (2+1)/(2+2)=75% Therefore, r1 is best candidate and r2 is worst candidate according to the Laplace measure. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  34. 34. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-9 6a. For the given Confusion-matrix below for 3 classes. Find sensitivity & specificity metrics to estimate predictive accuracy of classification methods. (10 Marks) Table 6.1: Confusion-matrix for three classes Ans: Solution: Let True positive (TP) = no. of positive-examples correctly predicted. False negative (FN) = no. of positive-examples wrongly predicted as negative. False positive (FP) = no. of negative-examples wrongly predicted as positive. True negative (TN) = no. of negative-examples correctly predicted. True positive rate (TPR) or sensitivity is given by True negative rate (TNR) or specificity is given by Actual classPredicted class Class 1= yes Class 1= no Class 1= yes 2 TP 9 FN Sensitivity is TPR = 8/(8+1) = 88.88 Class 1= no 8 FP 11 TN Specificity is TNR = 19/(19+2) = 90.47 Table 1: Confusion-matrix for the Class-1 Actual classPredicted class Class 2= yes Class 2= no Class 2= yes 2 TP 9 FN Sensitivity is TPR =2/(2+9) =18.18 Class 2= no 8 FP 11 TN Specificity is TNR = 11/(11+8) = 57.89 Table 2: Confusion-matrix for the Class-2 Actual classPredicted class Class 3= yes Class 3= no Class 3= yes 0 TP 0 FN Sensitivity is TPR = 0/(0+0) = 0 Class 3= no 10 FP 20 TN Specificity is TNR= 20/(20+10) = 66.66 Table 3: Confusion-matrix for the Class-3 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  35. 35. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-10 6b. Explain with example the two approaches for extending the binary-classifiers to handle multiclass problem. (10 Marks) Ans: TWO APPROACHES FOR EXTENDING THE BINARY-CLASSIFIERS 1) 1-r approach & 2) 1-1 approach • Let Y = {y1, y2, . . . , yK} be set of classes of input-data 1) 1-r (one-against-rest) Approach • This approach decomposes the multiclass-problem into K binary-problems. • For each class yi Є Y, a binary-problem is created. All instances that belong to yi are considered positive-examples. The remaining instances are considered negative-examples. A binary-classifier is then constructed to separate instances of class yi from the rest of the classes. 2) 1-1 (one-against-one) Approach • This approach constructs K(K − 1)/2 binary-classifiers. • Each classifier is used to distinguish between a pair of classes, (yi, yj). • Instances that do not belong to either yi or yj are ignored when constructing the binary-classifier for (yi, yj). • In both (1-r) and (1-1) approaches, a test-instance is classified by combining the predictions made by the binary-classifiers. • A voting-scheme is used to combine the predictions. • The class that receives the highest number of votes is assigned to the test-instance. • In the 1-r approach, if an instance is classified as negative, then all classes except for the positive-class receive a vote. Example: Consider a multiclass-problem where Y = {y1, y2, y3, y4}. • Suppose a test-instance is classified as (+,−,−,−) according to the (1-r) approach. • In other words, the test-instance is classified as → positive when y1 is used as the positive-class & → negative when y2, y3, and y4 are used as the positive-class. • Using a simple majority vote, notice that y1 receives the highest number of votes, which is four, while the remaining classes receive only three votes. • Therefore, the test-instance is classified as y1. • Suppose the test-instance is classified as follows using the 1-1 approach: • The first two rows in this table correspond to the pair of classes (yi, yj) chosen to build the classifier. • The last row represents the predicted class for the test-instance. • After combining the predictions, → y1 and y4 each receive two votes & → y2 and y3 each receives only one vote. • Therefore, the test-instance is classified as either y1 or y4, depending on the tie- breaking procedure. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  36. 36. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-11 7a. Explain K means clustering method and algorithm. (10 Marks) Ans: K-MEANS • K means is a partitional method of cluster analysis. • The objects are divided into non-overlapping clusters (or partitions) such that each object is in exactly one cluster. • The method obtains a single-level partition of objects. • This method can only be used if the data-object is located in the main memory. • The method is called K-means since each of the K clusters is represented by mean of the objects(called centriod) within it. • The method is also called the centroid-method since → at each step, the centroid-point of each cluster is assumed to be known & → each of the remaining points are allocated to cluster whose centroid is closest to it. K-MEANS ALGORITHM 1) Select the number of clusters=k. (Figure 7.1a). 2) Pick k seeds as centroids of k clusters. The seeds may be picked randomly. 3) Compute euclidean distance of each object in the dataset from each of the centroids. 4) Allocate each object to the cluster it is nearest to. 5) Compute the centroids of clusters. 6) Check if the stopping criterion has been met (i.e. cluster-membership is unchanged) If yes, go to step 7. If not, go to step 3. 7) One may decide → to stop at this stage or → to split a cluster or combine two clusters until a stopping criterion is met. Figure 7.1a LIMITATIONS OF K MEANS 1) The results of the method depend strongly on the initial guesses of the seeds. 2) The method can be sensitive to outliers. 3) The method does not consider the size of the clusters. 4) The method does not deal with overlapping clusters. 5) Often, the local optimum is not as good as the global optimum. 6) The method implicitly assumes spherical probability distribution. 7) The method cannot be used with categorical data. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  37. 37. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-12 7b. What is Hierarchical clustering method? Explain the algorithms for computing distances between clusters. (10 Marks) Ans: HIERARCHICAL METHODS • A set of nested clusters is organized as a hierarchical tree. (Figure 7.1b). • This approach allows clusters to be found at different levels of granularity. Figure 7.1b • Two types of hierarchical approaches are: 1) Agglomerative & 2) Divisive Approach 1) AGGLOMERATIVE APPROACH • This method is basically a bottom-up approach. • Each object at the start is a cluster by itself. • The nearby clusters are repeatedly merged resulting in larger clusters until all the objects are merged into a single large cluster (Figure 7.1c). Figure 7.1c AGGLOMERATIVE ALGORITHM 1) Allocate each point to a cluster of its own. Thus, we start with n clusters for n objects. 2) Create a distance-matrix by computing distances between all pairs of clusters (either using the single link metric or the complete link metric). Sort these distances in ascending order. 3) Find the 2 clusters that have the smallest distance between them. 4) Remove the pair of objects and merge them. 5) If there is only one cluster left then stop. 6) Compute all distances from the new cluster and update the distance-matrix after the merger and go to step 3. 2) DIVISIVE APPROACH • This method is basically a top-down approach. • This method → starts with the whole dataset as one cluster → then proceeds to recursively divide the cluster into two sub-clusters and → continues until each cluster has only one object (Figure 7.1d). • Two types are: 1) Monothetic: This splits a cluster using only one attribute at a time. An attribute that has the most variation could be selected. 2) Polythetic: This splits a cluster using all of the attributes together. Two clusters far apart could be build based on distance between objects. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  38. 38. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014 B-13 DIVISIVE ALGORITHM 1) Decide on a method of measuring the distance between 2 objects. Also, decide a threshold distance. 2) Create a distance-matrix by computing distances between all pairs of objects within the cluster. Sort these distances in ascending order. 3) Find the 2 objects that have the largest distance between them. They are the most dissimilar objects. 4) If the distance between the 2 objects is smaller than the pre-specified threshold and there is no other cluster that needs to be divided then stop, otherwise continue. 5) Use the pair of objects as seeds of a K-means method to create 2 new clusters. 6) If there is only one object in each cluster then stop otherwise continue with step 2. Figure 7.1d 8. Write short notes on the following: a. Web content mining b. Text mining c. Spatial-data-mining d. Spatio-temporal data-mining (20 Marks) Ans (a): For answer, refer Solved Paper Dec-2013 Q.No.8d. Ans (b): For answer, refer Solved Paper Dec-2013 Q.No.8a. Ans (c): For answer, refer Solved Paper Dec-2013 Q.No.8b. Ans (d): SPATIO TEMPORAL DATA-MINING • A spatiotemporal database is a database that manages both space and time information. • For example: Tracking of moving objects which occupies only a single position at a given time. • Spatio-temporal data-mining is an emerging research area. • This is dedicated to the development of computational techniques for the analysis of spatio- temporal databases. • This encompasses techniques for discovering useful spatial and temporal relationships that are not explicitly stored in spatio-temporal datasets. • Both the temporal and spatial dimensions add substantial complexity to data-mining tasks. • Classical data-mining techniques perform poorly when applied to spatio-temporal data-sets because: i) Spatial-data is embedded in a continuous space. Whereas, classical datasets are in discrete notions like transactions. ii) Since spatial-data are highly auto-correlated. A common assumption about independence of data samples in classical statistical analysis is generally false. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  39. 39. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  40. 40. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-1 1a. What is ODS? How does it differ from data-warehouse? Explain. (08 Marks) Ans: ODS (OPERATIONAL DATA STORE) • ODS is defined as a subject-oriented, integrated, volatile, current-valued data store, containing only corporate-detailed data. → ODS is subject-oriented i.e it is organized around main data-subjects of the company → ODS is integrated i.e. it is a collection of data from a variety of systems. → ODS is volatile i.e. data changes frequently, as new information refreshes ODS. → ODS is current-valued i.e it is up-to-date & reflects the current-status of information. → ODS is detailed i.e. it is detailed enough to serve needs of manager. 1b. Explain the guidelines for data-warehouse implementation. (08 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.1c. 1c. What is ETL? List steps of ETL. (04 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.1b. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  41. 41. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-2 2a. Why multidimensional views of data and data-cubes are used? With a neat diagram, explain data-cube implementations. (10 Marks) Ans: DATA-CUBE • Data-cube refers to multi-dimensional array of data. • The data-cube is used to represent data along some measure-of-interest. • Data-cubes allow us to look at complex data in a simple format. • For ex (Fig 2.1a): A company might summarize financial-data to compare sales i) by product ii) by date & iii) by country. Figure 2.1a: Data-cube of sales DATA-CUBE IMPLEMENTATION 1) Pre-compute and Store All • Millions of aggregates are computed and stored in data-cube. • Advantage: This is the best solution, as far as query response-time is concerned. • Disadvantages: i) This solution is impractical for a large data-cube. ii) Indexing large amounts of data is expensive. 2) Pre-compute (and Store) None • The aggregates are computed on-the-fly using raw data whenever a query is posed. • Advantage: This does not require additional space for storing the cube. • Disadvantage: The query response-time is very poor for large data-cubes. 3) Pre-compute and Store Some • Pre-compute and store the most frequently-queried aggregates and compute other aggregates as the need arises. • The remaining aggregates can be derived using the pre-computed aggregates. • The more aggregates we are able to pre-compute, the better the query-performance. • Data-cube products use following methods for pre-computing aggregates: i) ROLAP (relational OLAP) ii) MOLAP (multidimensional OLAP) For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  42. 42. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-3 2b. What are data-cube operations? Explain. (10 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.2b. 3a. What is data-mining? Explain various data-mining tasks. (06 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.3a. 3b. Why data preprocessing is required in DM? Explain various steps in data preprocessing (06 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.3b. 3c. Write a short note on data-mining applications. (04 Marks) Ans: DATA-MINING APPLICATIONS Prediction & Description • Data-mining may be used to answer questions like i) "Would this customer buy a product" or ii) "Is this customer likely to leave?” • DM techniques may also be used for sales forecasting and analysis. Relationship Marketing • Customers have a lifetime value, not just the value of a single sale. • Data-mining can helpful for i) Analyzing customer-profiles and improving direct marketing plans. ii) Identifying critical issues that determine client-loyalty. iii) Improving customer retention. Customer Profiling • This is the process of using the relevant- & available-information to i) Describe the characteristics of a group of customers. ii) Identify their discriminators from ordinary consumers. iii) Identify drivers for their purchasing decisions. • This can help the company identify its most valuable customers so that the company may differentiate their needs and values. Outliers Identification & Detecting Fraud • For this, examples include: i) Identifying unusual expense claims by staff. ii) Identifying anomalies in expenditure b/w similar units of the company. iii) Identifying fraud involving credit-cards. Customer Segmentation • This is a way to assess & view individuals in market based on their status & needs. • Data-mining may be used to i) Understand & predict customer behavior & profitability. ii) Develop new products & services. iii) Effectively market new offerings. Web site Design & Promotion • Web mining can be used to discover how users navigate a web-site and the results can help in improving the site-design. • Web mining can be used in cross-selling by suggesting to a web-customer, items that he may be interested in. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  43. 43. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-4 4a. Explain FP - growth algorithm for discovering frequent-item sets. What are its limitations? (08 Marks) Ans: FP - GROWTH ALGORITHM • This algorithm → encodes the data-set using a compact data-structure called a FP-tree & → extracts frequent-itemsets directly from this structure (Figure 6.24). • This finds all the frequent-itemsets ending with a particular suffix. • This employs a divide-and-conquer strategy to split the problem into smaller subproblems. • For example, suppose we are interested in finding all frequent-itemsets ending in e. To do this, we must first check whether the itemset {e} itself is frequent. If it is frequent, we consider subproblem of finding frequent-itemsets ending in de, followed by ce, be, and ae. • In turn, each of these subproblems is further decomposed into smaller subproblems. • By merging the solutions obtained from the subproblems, all the frequent-itemsets ending in e can be found (Figure 6.27). LIMITATIONS OF FP - GROWTH ALGORITHM • The run-time performance of FP-growth depends on the compaction factor of the data-set. • If the resulting conditional FP-trees are very bushy, then the performance of the algorithm degrades significantly. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  44. 44. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-5 4b. What is Apriori algorithm? How it is used to find frequent-item sets? Explain briefly. (08 Marks) Ans: APRIORI ALGORITHM • Apriori Theorem states: “If an itemset is frequent, then all of its subsets must also be frequent.” • Consider the following example. Suppose {c, d, e} is a frequent-itemset then any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e} {c}, {d} and {e} (Figure 6.3). As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent. Frequent-itemset Generation in the Apriori Algorithm • We assume that the support threshold is 60%, which is equivalent to a minimum support count equal to 3 (Table 6.1). • Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent. • Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by the algorithm is = 6. • Two of these six candidates, {Beer, Bread} and {Beer, Milk} are subsequently found to be infrequent after computing their support values. • Remaining 4candidates are frequent, and thus will be used to generate candidate 3-itemsets. • With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent (Figure 6.5). • The only candidate that has this property is {Bread, Diapers, Milk}. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  45. 45. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-6 4c. List the measures used for evaluating association patterns. (04 Marks) Ans: 5a. How decision-trees are used for classification? Explain decision-tree induction algorithm for classification. (10 Marks) Ans: HUNT’S ALGORITHM • A decision-tree is grown in a recursive fashion. • Let Dt = set of training-records that are associated with node„t‟. Let y = {y1, y2, . . . , yc} be class-labels. • Hunt‟s algorithm is as follows. Step 1: • If all records in Dt belong to same class yt, then t is a leaf node labeled as yt. Step 2: • If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  46. 46. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-7 DECISION-TREE ALGORITHM: TREEGROWTH • The input to the algorithm consists of i) training-records E and ii) attribute-set F. • The algorithm works by i) Recursively selecting the best attribute to split the data (Step 7) and ii) Expanding leaf nodes of tree (Steps 11 & 12) until stopping criterion is met (Step 1). The details of this algorithm are explained below: 1. The createNode() function extends the decision-tree by creating a new node. A node in the decision-tree has either a test condition, denoted as node.test_cond, or a class-label, denoted as node.label. 2. The find_best_split() function determines which attribute should be selected as the test condition for splitting the training-records. The choice of test condition depends on which impurity measure is used to determine the goodness of a split. Some widely used measures include entropy, Gini index. 3. The Classify() function determines the class-label to be assigned to a leaf node. 4. The stopping_cond() function is used to terminate the tree-growing process by testing whether all records have either i)same class-label or ii)same attribute values. 5. After building the decision-tree, a tree-pruning step can be performed to reduce the size of the decision-tree. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  47. 47. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-8 5b. How to improve accuracy of classification? Explain. (05 Marks) Ans: BAGGING • Bagging is also known as bootstrap aggregating. • Bagging is a technique that repeatedly samples (with replacement) from a data-set according to a uniform probability distribution. • Each bootstrap sample has the same size as the original data. • Because the sampling is done with replacement, some instances may appear several times in the same training-set & other instances may be omitted from the training-set. BOOSTING • Boosting is an iterative procedure used to adaptively change the distribution of training examples so that the base classifiers will focus on examples that are hard to classify. • Unlike bagging, boosting assigns a weight to each training example and may adaptively change the weight at the end of each boosting round. • The weights assigned to the training examples can be used in following ways: 1. They can be used as a sampling distribution to draw a set of bootstrap samples from the original data. 2. They can be used by the base classifier to learn a model that is biased toward higher-weight examples. 5c. Explain importance of evaluation criteria for classification methods. (05 Marks) Ans: • Predictive Accuracy: refers to the ability of the model to correctly predict the class-label of new or previously unseen data. • Speed: refers to the computation costs involved in generating and using the model. Speed involves not just the time or computation cost of constructing a model (e.g. a decision-tree), it also includes the time required to learn to use the model. • Robustness: is the ability of the model to make correct predictions given noisy data or data with missing values. Most data obtained from a variety of sources has errors. Therefore, the method should be able to deal with noise, outlier & missing values gracefully. • Scalability: refers to ability to construct the model efficiently given large amount of data. Data-mining problems can be large and therefore the method should be able to deal with large problems gracefully. • Interpretability: refers to level of understanding & insight that is provided by the model. An important task of a DM professional is to ensure that the results of data-mining are explained to the decision-makers. It is therefore desirable that the end-user be able to understand and gain insight from the results produced by the classification-method. • Goodness of the model: For a model to be effective, it needs to fit the problem that is being solved. For example, in a decision-tree classification, it is desirable to find a decision-tree of the right size and compactness with high accuracy. 6a. What are Baysian classifiers? Explain Baye's theorem. (10 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.6a. 6b. How rule based classifiers are used for classification? Explain. (10 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.5b. 7a. Explain K-means clustering algorithm. What are its limitations? (10 Marks) Ans: For answer, refer Solved Paper June-2014 Q.No.7a. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  48. 48. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-9 7b. How density based methods are used for clustering? Explain. (10 Marks) Ans: DENSITY-BASED METHODS • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. • Typically, for each data-point in a cluster, at least a minimum number of points must exist within a given radius. • Data that is not within such high-density clusters is regarded as outliers or noise. • For example: DBSCAN (Density Based Spatial Clustering of Applications with Noise). DBSCAN • It requires 2 input parameters: 1) Size of the neighborhood (R) & 2) Minimum points in the neighborhood (N). • The point-parameter N → determines the density of acceptable-clusters & → determines which objects will be labeled outliers or noise. • The size-parameter R determines the size of the clusters found. • If R is big enough, there will be one big cluster and no outliers. If R is small, there will be small dense clusters and there might be many outliers. • We define a number of terms (Figure 7.2): 1. Neighborhood: The neighborhood of an object y is defined as all the objects that are within the radius R from y. 2. Core-object: An object y is called a core-object if there are N objects within its neighborhood. 3. Proximity: Two objects are defined to be in proximity to each other if they belong to the same cluster. Object x1 is in proximity to object x2 if two conditions are satisfied: i) The objects are close enough to each other, i.e. within a distance of R. ii) x2 is a core object. 4. Connectivity: Two objects x1 and xn are connected if there is a chain of objects x1,x2. . . .xn from x1 to xn such that each xi+1 is in proximity to object xi. DBSCAN ALGORITHM 1. Select values of R and N. 2. Arbitrarily select an object p. 3. Retrieve all objects that are connected to p, given R and N. 4. If p is a core object, a cluster is formed. 5. If p is a border object, no objects are in its proximity. Choose another object. Go to step 3. 6. Continue the process until all of the objects have been processed. Figure 7.2:DBSCAN For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  49. 49. DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014 C-10 8a. What is web content mining? Explain. (08 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.8d. 8b. Write a short note on following: (12 Marks) i) Text mining ii) Temporal database mining iii) Text clustering . Ans (i): For answer, refer Solved Paper Dec-2013 Q.No.8a. Ans (ii): For answer, refer Solved Paper Dec-2013 Q.No.8c. Ans (iii): TEXT CLUSTERING • Once the features of an unstructured-text are identified, text-clustering can be done. • Text-clustering can be done by using any clustering technique. For ex: ward's minimum variance method. • Ward‟s method is an agglomerative hierarchical clustering technique. • Ward‟s method tends to generate very compact clusters. • Following measure of the dissimilarities between feature vectors can be used i) Euclidean metric or ii) Hamming distance • The clustering method begins with „n‟ clusters, one for each text. • At any stage, 2 clusters are merged to generate a new cluster based on the following criterion: where xk is mean value of the dissimilarity for cluster Ck and nk is the no. of elements in cluster. SCATTER/GATHER • It is a method of grouping the documents based on the overall similarities in their content. • Scatter/gather is so named because → it allows the user to scatter documents into groups(or clusters) → then gather a subset of these groups and → re-scatter them to form new groups. • Each cluster is represented by a list of topical terms. • Topical terms are a list of words that attempt to give the user an idea of what the documents in the cluster are about. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  50. 50. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  51. 51. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  52. 52. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-1 1a. Explain the characteristics of ODS. (06 Marks) Ans: ODS (OPERATIONAL DATA STORE) • ODS is defined as a subject-oriented, integrated, volatile, current-valued data store, containing only corporate-detailed data. → ODS is subject-oriented i.e it is organized around main data-subjects of the company → ODS is integrated i.e. it is a collection of data from a variety of systems. → ODS is volatile i.e. data changes frequently, as new information refreshes ODS. → ODS is current-valued i.e it is up-to-date & reflects the current-status of information. → ODS is detailed i.e. it is detailed enough to serve needs of manager. • Benefits of ODS to the company: 1) ODS is the unified-operational view of the company. ODS provides the managers improved access to important operational-data. This view assists in better understanding of i) business & ii) customer. 2) ODS is more effective in generating current-reports without accessing OLTP. 3) ODS can shorten time required to implement a data-warehouse system. • Different types of ODS: 1) The ODS can be used as a reporting-tool for administrative purposes. The ODS is usually updated daily. 2) The ODS can be used to track more complex-information such as product-code & location-code. The ODS is usually updated hourly. 3) The ODS can be used to support CRM (Customer Relationship Management). 1b. List the major steps involved in the ETL process. (06 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.1b. 1c. Based on oracle, what are difference b/w OLTP & DW systems. (08 Marks) Ans: For answer, refer Solved Paper June-2014 Q.No.1c. 2a. Discuss the FASMI characteristics of OLAP. (05 Marks) Ans: FASMI CHARACTERISTICS OF OLAP SYSTEMS 1) Fast • Most queries must be answered very quickly, perhaps within seconds. • The performance of the system must be like a search-engine. • The data-structures must be efficient. • The hardware must be powerful to support i) Large amount of data & ii) Large number of users. • One of the approaches to speed-up the system is → pre-compute the most commonly queried aggregates & → compute the remaining aggregates on-the-fly. 2) Analytic • The system must provide rich analytic-functionality. • Most queries must be answered without any programming. • System must be able to manage any relevant queries for application & user. 3) Shared • The system is → accessed by few business-analysts & → used by thousands of users. • Being a shared system, the OLAP software must provide adequate security for i) confidentiality & ii) integrity. • Concurrency-control is required if users are updating data in the database. 4) Multidimensional • This is the basic requirement. • OLAP software must provide a multidimensional conceptual-view of the data. • A dimension has hierarchies that show parent/child relationships between the members of dimensions. • The multidimensional structure must allow hierarchies of parent/child relationships. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  53. 53. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-2 5) Information • The system should be able to handle a large amount of input-data. • Two important critical factors: i) The capacity of system to handle information & ii) Integration of information with the data-warehouse. 2b. Explain Codd's OLAP rules. (10 Marks) Ans: CODD'S OLAP CHARACTERISTICS Multidimensional Conceptual View • This is the central characteristics. • Because of multidimensional-view, data-cube operations like slice and dice can be performed. Accessibility (OLAP as a Mediator) • The OLAP software should be sitting b/w i) Data-sources & ii) OLAP front-end. Batch Extraction vs. Interpretive • In large multidimensional databases, the system should provide → multidimensional-data staging plus → partial pre-calculation of aggregates. Multi-user Support • Being a shared -system, the OLAP software should provide normal database operations including retrieval, update, integrity and security. Storing results of OLAP • OLAP results-data should be kept separate from source-data. • Read-write applications should not be implemented directly on live transaction-data if source-systems are supplying information to the system directly. Extraction of Missing Values • The system should distinguish missing-values from zero-values. • If a distinction is not made, then the aggregates are computed incorrectly. Uniform Reporting Performance • Increasing the number of dimensions (or database-size) should not degrade the reporting performance of the system. 2c. Describe the difference between ROLAP & MOLAP. (05 Marks) Ans: 3a. What is data preprocessing? Explain various pre-processing tasks. (14 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.3b. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  54. 54. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-3 3b. Explain the following: (06 Marks) i) Euclidean distance ii) Simple matching coefficient iii) Jaccard coefficient. Ans (i): EUCLIDEAN DISTANCE • The Euclidean distance (D) between two points x and y is given by: where xi and yi are respectively the ith attributes of x & y Example: Find the distance between 2 objects represented by attribute values: x = (1, 6, 2, 5, 3) & y = (3, 5, 2, 6, 6) Solution: Let (x1, x2, x3, x4, x5) = (1, 6, 2, 5, 3) (y1, y2, y3, y4, y5) = (3, 5, 2, 6, 6) Euclidean Distance is calculated as follows: Ans (ii): SIMPLE MATCHING COEFFICIENT • SMC is used as a similarity coefficient. • SMC is given by where f00= no. of attributes where x is 0 and y is 0 • This measure counts both presences and absences equally. Ans (iii): JACCARD COEFFICIENT • The Jaccard coefficient is used to handle objects consisting of asymmetric binary attributes. • The jaccard coefficient is given by: Example: Calculate SMC and Jaccard Similarity Coefficients for the following two binary vectors: x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) & y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 1) Solution: 4a. Explain frequent-itemset generation in the apriori algorithm. (10 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.4a. 4b. What is FP - Growth algorithm? In what way it is used to find frequency itemsets? (03 Marks) Ans: For answer, refer Solved Paper Dec-2014 Q.No.4a. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  55. 55. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-4 4c. Construct the FP tree for following data-set. Show the trees separately after reading each transaction. (07 Marks) Ans: For answer, refer Solved Paper Dec-2013 Q.No.4b. 5a. What is classification? Explain 2 classification-models with example. (06 Marks) Ans: CLASSIFICATION • Classification is the task of learning a target-function ‘f’ that maps each attribute-set x to one of the predefined class-labels y. • The target-function is also known as a classification-model. • A classification-model is useful for the following 2 purposes (Figure 4.3): 1) DESCRIPTIVE MODELING • A classification-model can serve as an explanatory-tool to distinguish between objects of different classes. • For example, it is useful for biologists to have a descriptive model. 2) PREDICTIVE MODELING • A classification-model can also be used to predict the class-label of unknown-records. • As a classification-model automatically assigns a class-label when presented with the attribute-set of an unknown-record. • Classification techniques are most suited for predicting or describing data-sets with binary- or nominal-categories. • They are less effective for ordinal categories because they do not consider the implicit order among the categories. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  56. 56. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-5 5b. Discuss the characteristics of decision-tree induction algorithms. (10 Marks) Ans: CHARACTERISTICS OF DT INDUCTION ALGORITHMS 1. Decision-tree induction is a non-parametric approach for building classification- models. 2. Finding an optimal tree is NP complete problem. Many DM algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space. 3. Techniques developed for constructing trees are computationally inexpensive i.e. it is possible to quickly construct models even when the training-set size is very large. Furthermore, once a tree has been built, classifying a test-record is extremely fast, with a worst-case complexity of O(w) where w = maximum depth of tree 4. Smaller-sized trees are relatively easy to interpret. 5. Trees provide an expressive representation for learning discrete valued functions. However, they do not generalize well to certain types of Boolean problems. 6. A subtree can be replicated multiple times in a tree (Figure 4.19). This makes the tree more complex than necessary and perhaps more difficult to interpret. The algorithm solves the sub-trees by using the divide and conquer algorithm to avoid complexity. 7. DT algorithms are quite robust to the presence of noise, especially when methods for avoiding overfitting, are employed. 8. The presence of redundant attributes does not affect the accuracy of trees. An attribute is redundant if it is strongly correlated with another attribute in data. 9. At the leaf nodes, the number of records may be too small to make a statistically significant decision about the class representation of the nodes. This is known as the data fragmentation problem. Solution: Disallow further splitting when the number of records falls below a certain threshold. 10. The tree-growing procedure can be viewed as process of partitioning the attribute- space into disjoint regions until each region contains records of the same class. The border between two neighboring regions of different classes is known as a decision boundary. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  57. 57. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-6 5c. Explain sequential covering algorithm in rule-based classifier. (04 Marks) Ans: SEQUENTIAL COVERING ALGORITHM • This is used to extract rules directly from data. • This extracts the rules one class at a time for data-sets that contain more than 2 classes. • The criterion for deciding which class should be generated first depends on: i) Fraction of training-records that belong to a particular class or ii) Cost of misclassifying records from a given class. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  58. 58. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-7 6a. List 5 criteria for evaluating classification methods. Explain briefly. (05 Marks) Ans: FIVE CRITERIA FOR EVALUATING CLASSIFICATION METHODS 1) Holdout method 2) Random Subsampling 3) Cross-Validation 4) Leave-one-out approach 5) Bootstrap 1) Holdout method • The original data is divided into 2 disjoint set: i) Training-set & ii) Test-set. • A classification-model is induced from the training-set. • Performance of classification-model is evaluated on the test-set. • The proportion of data is reserved as either i) 50% for training and 50% for testing or ii) 2/3 for training and 1/3 for testing. • The accuracy of the classifier can be estimated based on accuracy of induced model. 2) Random Subsampling • The holdout method can be repeated several times to improve the estimation of a classifier's performance. • Limitation: It has no control over the number of times each record is used for testing & training. 3) Cross-Validation • In K-fold cross-validation, the available data is randomly divided into k-disjoint subsets of approximately equal-size. • One of the subsets is then used as the test-set. Remaining (k – 1) sets are used for building the classifier. • The test-set is used to estimate the accuracy. • This is done repeatedly k times so that each subset is used as a test subset once. 4) Leave-one-out approach • A special case of k-fold cross-validation method sets k = N, the size of the data-set. • Each test-set contains only one record. • Advantages: 1) Utilizes as much data as possible for training. 2) Test-sets are mutually exclusive & they effectively cover entire data-set. • Two drawbacks: 1) Computationally expensive for large datasets. 2) Since each test-set contains only one record, the variance of the estimated performance metric tends to be high. 5) Bootstrap • The training-records are sampled with replacement; i.e. a record already chosen for training is put back into the original pool of records so that it is equally likely to be redrawn. • A sample contains about 63.2% of the records in the original data. • Records that are not included in the bootstrap sample become part of the test-set. • The model induced from the training-set is applied to the test-set to obtain an estimate of the accuracy of the bootstrap sample, εi. • The sampling procedure is then repeated ‘b’ times to generate ‘b’ bootstrap samples. 6b. What is predictive accuracy of classification methods? Explain different types of estimating the accuracy of a method. (07 Marks) Ans: For answer, refer Solved Paper Dec-2014 Q.No.5c. For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I
  59. 59. DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015 D-8 6c. Consider the following training-set for predicting the loan default problem: Find the conditional independence for given training-set using Bayes theorem for classification. (08 Marks) Ans: Solution: • For each class yj, the class-conditional probability for attribute Xi is: where μ(x’)=mean σ2 (s2 )=variance • The sample mean and variance for annual income attribute with respect to the class No are: • Given a test-record with taxable income equal to $120K, we can compute its class- conditional probability as follows: • Since there are three records that belong to the class Yes and seven records that belong to the class No, P(Yes) = 0.3 and P(No) = 0.7. • Using the information provided in Figure 5.10(b), the class-conditional probabilities can be computed as follows: For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/ VTU N O TESBYSR I

×