Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Insert graph of entropy?
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
2 major problems: o How to find split points that define node tests o How to partition the data having found split points o CART, C4.5: - DFS - Repeated sorting at each node (continuous attributes) o SLIQ - SortOnce technique using attribute lists
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
2 major problems: o How to find split points that define node tests o How to partition the data having found split points o CART, C4.5: - DFS - Repeated sorting at each node (continuous attributes) o SLIQ - SortOnce technique using attribute lists
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
Transcript
1.
Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay
9.
Basic algorithm for tree building Greedy top-down construction. Gen_Tree (Node, data) make node a leaf? Yes Stop Find best attribute and best split on attribute Partition data on split condition For each child j of node Gen_Tree (node_j, data_j) Selection criteria
Evaluate this split-form for every value in an attribute list
To evaluate splits on attribute A for a given tree-node:
Initialize class-histogram of left child to zeroes; Initialize class-histogram of right child to same as its parent; for each record in the attribute list do evaluate splitting index for value(A) < record.value ; using class label of the record, update class histograms;
25.
Finding Split Points: Continuous Attrib. Attribute List Position of cursor in scan 0: Age < 17 3: Age < 32 6 State of Class Histograms: Left Child Right Child 1: Age < 20 GINI Index: GINI = undef GINI = 0.4 GINI = 0.222 GINI = undef
Consider splits of the form: value(A) {x1, x2, ..., xn}
Example: CarType {family, sports}
Evaluate this split-form for subsets of domain(A)
To evaluate splits on attribute A for a given tree node:
initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix ;
27.
Finding Split Points: Categorical Attrib. Attribute List class/value matrix CarType in {family} Left Child Right Child GINI Index: CarType in {truck} GINI = 0.444 GINI = 0.267 CarType in {sports} GINI = 0.333
The attribute lists of every node must be divided among the two children
To split the attribute lists of a give node:
for the list of the attribute used to split this node do use the split test to divide the records; collect the record ids; build a hashtable from the collected ids; for the remaining attribute lists do use the hashtable to divide each list; build class-histograms for each new leaf;
29.
Performing the Splits: Example Age < 32 Hash Table 0 Left 1 Left 2 Right 3 Right 4 Right 5 Left
Idea: a branch of the tree is over-fitted if the training examples that fit under it can be explicitly enumerated (with classes) in less space than occupied by tree
Prune branch if over-fitted
philosophy: use tree that minimizes description length of training data
choose subset of original features using random projections, feature selection techniques
transform original features using statistical methods like Principal Component Analysis
Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures.
Define non-distance based (model-based) clustering methods:
Be the first to comment