DATA MINING
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

DATA MINING

on

  • 1,130 views

 

Statistics

Views

Total Views
1,130
Views on SlideShare
1,130
Embed Views
0

Actions

Likes
0
Downloads
32
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

DATA MINING Document Transcript

  • 1. NAME: IDNO: BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI I SEMESTER 2007-2008 CS C415/IS C415 – DATA MINING Comprehensive Examination Weightage: 40% 05th December, 2007 Duration: 3 Hours PART A – CLOSED BOOK Multiple Choice Questions (38*0.5=19) • A question may have multiple correct answers. Credit will be given only when you mark all the correct options • There is NO NEGATIVE MARKING • ENCIRCLE the correct option(s) using ink • Mark you answers in the question paper itself 1. A more appropriate name for Data Mining could be: a. Internet Mining b. Data Warehouse Mining c. Knowledge Mining d. Database Mining 2. The most general form of distance is: a. Euclidean distance b. Manhattan distance c. Minkowski distance d. Supermum distance 3. Pick the odd one out: a. SQL b. Data Warehouse c. Data Mining d. OLAP 4. A Data Warehouse is a good source of data for the downstream data mining applications because: a. It contains historical data b. It contains aggregated data c. It contains integrated data d. It contains preprocessed data 5. Pick the right sequence: a. OLTP-DW-DM-OLAP b. OLTP-DW-OLAP-DM c. DW-OLTP- OLAP- DM d. OLAP-OLTP-DW-DM 6. Scalable DM algorithm are those for which a. Running time remains same with increasing amount of data b. Running time increases exponentially with increasing amount of data c. Running time decreases with increasing amount of data d. Running time increases linearly with increasing amount of data 7. Removing some irrelevant attributes from data sets is called: a. Data Pruning b. Normalization c. Dimensionality reduction d. Attribute subset selection 8. Which is(are) true about noise & outliers in a data set: a. Noise represents erroneous values b. Noise represents unusual behavior c. Outlier may be there due to noise d. Noise may be there due to outliers
  • 2. 9. Which two come nearest to each other: a. Association Rules & Classification b. Classification & Prediction c. Classification & Clustering d. Association Rules & Clustering 10. Fraud detection can be done using: a. Temporal ARs b. Classification c. Clustering d. Prediction 11. Association rules X⇒Y & Y⇒X both exist for a given min_sup and min_conf. Pick the correct statement(s): a. Both ARs have same support & confidence b. Both ARs have different support & confidence c. Support is same but not confidence d. Confidence is same but not support 12. The AR: Bread, Butter ⇒Jam is an example of a. Boolean, Quantitative AR b. Boolean, Multilevel AR c. Multidimensional, Multilevel AR d. Boolean, Single-dimensional AR 13. In sampling algorithm, if all the large itemsets are in the set of potentially large itemsets generated from the sample, then the number of database scans needed to find all large itemsets are: a. 2 b. 3 c. 1 d. 0 14. In market-basket analysis, for an association rule to have business value, it should have: a. Confidence b. Support c. Both d. None 15. In Apriori algorithm, if large 1-itemsets are 50, then the number of candidate 2-itemsets will be: a. 50 b. 25 c. 1230 d. 50! (50 factorial) 16. Pick the odd one out: a. Some patients tend to develop reactions after two months with this combination of drugs b. Any person who buys a car also buys a steering lock c. Flooding in the east coast occurs only during the monsoon d. A drop in atmospheric pressure precedes rainfall in 60% of the cases 17. Pick the odd one out: a. Apriori Algorithm b. Sampling Algorithm c. Frequent-Pattern Growth Algorithm d. Partitioning Algorithm 18. For the AR A⇒B, the confidence a. Decreases with the increase in frequency of B b. Increases with the increase in frequency of B c. Is not affected by frequency of B d. Is not affected by frequency of A
  • 3. 19. Pick the correct statement about decision tree based classification: a. Model under fitting & over fitting can happen together b. Model over fitting is a more serious problem c. Model under fitting is a more serious problem d. Model under fitting is a due to presence of niose 20. Ensemble methods are used to: a. Evaluate classifier accuracy b. Compare goodness of a clustering algorithms c. Improve classifier accuracy d. Compare two classifiers 21. Example(s) of ensemble methods: a. Boosting b. Bootstrapping c. Bagging d. K-fold cross validation 22. Which type of classifier would you prefer? A classifier with: a. Low training error & high generalization error b. High training error & high generalization error c. High training error & low generalization error d. Zero training error & high generalization error 23. As the complexity of a classifier model increases, the training error: a. Increases b. Decreases c. Remains the same d. Cannot say 24. Mark the statement(s) that are true distance based classification algorithms like k - Nearest Neighbors: a. Sensitive to the choice of k b. Classifying unknown records is expensive c. Lazy learner d. Different levels of variation in different attributes can lead to wrong classification 25. Model under fitting leads to: a. High training error & low generalization error Decreases b. Low training error & high generalization error c. Zero training error & high generalization error d. High training error & high generalization error 26. Distance between clusters can be measured using: a. Single link b. Average link c. Centroid d. Mediod 27. K-means & K-mediods are: a. Hierarchical agglomerative methods b. Partitioning methods c. Density-based methods d. Hierarchical divisive methods 28. Hierarchical divisive methods for clustering: a. Single link b. Complete link c. Average link d. Minimum spanning tree
  • 4. 29. K-modes method for data clustering is: a. Similar to K-means b. Similar to K-mediods c. A variant of K-means for clustering categorical data d. A an efficient version of K-means in terms of convergence 30. The number of cost calculations in K-mediods for n data points and K clusters is: a. n!(n-K)! b. n(n-K)* number of iterations c. (n-K)! d. n*K 31. Mark the statement(s) that are true for K-means clustering algorithm: a. Running time is O(tkn) where t=no. of iterations, n= no. of data points, & k= no. of clusters b. Sensitive to initial seeds c. Sensitive to noise & outliers d. More efficient than k-medoids 32. For DBSCAN algorithm, pick the incorrect statement a. Density reachability is asymmetric b. Density connectivity is symmetric c. Both density reachability and connectivity are symmetric d. Arbitrary shaped clusters can be found 33. Multi relational data mining is primarily concerned with mining: a. Multidimensional ARs b. Multidimensional data c. Data in multiple relations d. Multiple patterns 34. Approaches to multi-relational data mining: a. Propositionalization b. Dynamic Programming c. Inductive Logic Programming d. Statistical Relational Learning 35. OLE DB for DM is a: a. Industry standard process model for DM b. Microsoft DM solution c. Guideline for ISVs concerned with developing DM software d. Collection of APIs 36. In CRISP DM, the model assessment is done during the: a. Data preparation phase b. Modeling phase c. Evaluation phase d. Deployment phase 37. In document representation, stemming means: a. Removal of stop words b. Removal of duplicate words c. Reducing words to their root form d. All of the above 38. Web Usage Mining can be used for: a. Personalization b. Searching c. Site modification d. Target advertising
  • 5. BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI I SEMESTER 2007-2008 CS C415/IS C415 – DATA MINING Comprehensive Examination Weightage: 40% 05th December, 2007 Duration: 3 Hours PART B – OPEN BOOK Weightage: 21% 1. A classifier is tested with a number of test data. The classifier output and the correct class are shown below. Draw the confusion matrix for the classifier. [2] Srl No. Classifier Correct Class Output 1 C1 C2 2 C1 C1 3 C1 C3 4 C2 C2 5 C2 C2 6 C2 C2 7 C3 C1 8 C3 C1 9 C3 C1 2. Consider the 5 transactions given below. If minimum support is 30% and minimum confidence is 80%, determine the frequent itemsets and association rules using the a priori algorithm. [3] Transaction Items T1 Bread, Jelly, Butter T2 Bread, Butter T3 Bread, Milk, Butter T4 Coke, Bread T5 Coke, Milk 3. Use complete link algorithm to cluster the following data points giving equal weightage to all the attributes. Find no. of natural clusters present in the data. Also solve the problem with k-mediods (PAM) methods and compare the results. (use Euclidean distance wherever needed). Inccome ($ 000’s) Plot Size (000’s sq. ft.) 60 18.4
  • 6. 87 23.6 110.1 19.2 75 19.6 64.8 17.2 49.2 17.6 [1+1+1.5+2+0.5] 4. The table below summarizes a dataset with three attributes A, B, & C and two class labels X & Y. A B C Class X Class Y T T T 5 0 F T T 0 20 T F T 20 0 F F T 0 5 T T F 0 0 F T F 25 0 T F F 0 0 F F F 0 25 a. Build a decision tree using the ID3 and deduce the classification rules. b. How much is the training error (%age) in the decision tree of a.? c. Did model underfitting or overfitting take place? Justify your answer. d. For the given test data, find out the generalization error of the classifier: A B C Class F T T X F F T Y T F T X F F T Y T T T X [3+1+1+1] 5. Use Data Mining to build a simple Pattern Recognizing Software. The software should be able to identify capital English language alphabets from A to Z. Model the pattern recognition problem as a Data Mining problem. Give schema of the data set that you will use with some sample values. [5]