The document discusses classification, which involves using a training dataset containing records with attributes and classes to build a model that can predict the class of previously unseen records. It describes dividing the dataset into training and test sets, with the training set used to build the model and test set to validate it. Decision trees are presented as a classification method that splits records recursively into partitions based on attribute values to optimize a metric like GINI impurity, allowing predictions to be made by following the tree branches. The key steps of building a decision tree classifier are outlined.
* Logistic regression, logistic loss (log loss)
* stochastic optimization
* adding new features, generalized linear model
* Kernel trick, intro to SVM
* Overfitting
* Decision trees for classification and regression
* Building trees greedily: Gini index, entropy
* Trees fighting with overfitting: pre-stopping and post-pruning
* Feature importances
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
* Logistic regression, logistic loss (log loss)
* stochastic optimization
* adding new features, generalized linear model
* Kernel trick, intro to SVM
* Overfitting
* Decision trees for classification and regression
* Building trees greedily: Gini index, entropy
* Trees fighting with overfitting: pre-stopping and post-pruning
* Feature importances
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Date: September 25, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
The KNN (K Nearest Neighbors) algorithm analyzes all available data points and classifies this data, then classifies new cases based on these established categories. It is useful for recognizing patterns and for estimating. The KNN Classification algorithm is useful in determining probable outcome and results, and in forecasting and predicting results, given the existence of multiple variables.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Date: September 25, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
The KNN (K Nearest Neighbors) algorithm analyzes all available data points and classifies this data, then classifies new cases based on these established categories. It is useful for recognizing patterns and for estimating. The KNN Classification algorithm is useful in determining probable outcome and results, and in forecasting and predicting results, given the existence of multiple variables.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
2. Given a collection of records (training set )
◦ Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
◦ A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
3. Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
Refund Marital
Status
Taxable
Income Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?
10
Test
Set
Training
Set
Model
Learn
Classifier
4. Direct Marketing
◦ Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
◦ Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier
model.
From [Berry & Linoff] Data Mining Techniques, 1997
5. Fraud Detection
◦ Goal: Predict fraudulent cases in credit card
transactions.
◦ Approach:
Use credit card transactions and the information on its account-
holder as attributes.
When does a customer buy, what does he buy, how often he pays
on time, etc
Label past transactions as fraud or fair transactions. This forms
the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
6. Customer Attrition:
◦ Goal: To predict whether a customer is likely to be lost to
a competitor.
◦ Approach:
Use detailed record of transactions with each of the past and
present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
7. Sky Survey Cataloging
◦ Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
◦ Approach:
Segment the image.
Measure image attributes (features) - 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
8. Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
9. Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
18. Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
19. Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
20. Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that fits
the same data!
27. Collection of data objects and
their attributes
An attribute is a property or
characteristic of an object
◦ Examples: eye color of a
person, temperature, etc.
◦ Attribute is also known as
variable, field, characteristic, or
feature
A collection of attributes
describe an object
◦ Object is also known as record,
point, case, sample, entity, or
instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
28. Attribute values are numbers or symbols assigned
to an attribute
Distinction between attributes and attribute values
◦ Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
◦ Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
29. There are different types of attributes
◦ Nominal
Examples: ID numbers, eye color, zip codes
◦ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
◦ Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
◦ Ratio
Examples: temperature in Kelvin, length, time, counts
30. The type of an attribute depends on which of the
following properties it possesses:
◦ Distinctness: = ≠
◦ Order: < >
◦ Addition: + -
◦ Multiplication: * /
◦ Nominal attribute: distinctness
◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & addition
◦ Ratio attribute: all 4 properties
31. Attribute
Type
Description Examples Operations
Nominal The values of a nominal
attribute are just different
names, i.e., nominal
attributes provide only
enough information to
distinguish one object from
another. (=, ≠)
zip codes,
employee ID
numbers, eye
color, sex: {male,
female}
mode, entropy,
contingency
correlation, χ2
test
Ordinal The values of an ordinal
attribute provide enough
information to order objects.
(<, >)
hardness of
minerals, {good,
better, best},
grades, street
numbers
median,
percentiles,
rank
correlation, run
tests, sign tests
Interval For interval attributes, the
differences between values
are meaningful, i.e., a unit
of measurement exists.
(+, - )
calendar dates,
temperature in
Celsius or
Fahrenheit
mean, standard
deviation,
Pearson's
correlation, t
and F tests
Ratio For ratio variables, both
differences and ratios are
meaningful. (*, /)
temperature in
Kelvin, monetary
quantities, counts,
age, mass, length,
electrical current
geometric
mean,
harmonic
mean, percent
variation
32. Attribute
Level
Transformation Comments
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
33. Discrete Attribute
◦ Has only a finite or countably infinite set of values
◦ Examples: zip codes, counts, or the set of words in a collection of
documents
◦ Often represented as integer variables.
◦ Note: binary attributes are a special case of discrete attributes
Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight.
◦ Practically, real values can only be measured and represented using
a finite number of digits.
◦ Continuous attributes are typically represented as floating-point
variables.
34. Greedy strategy.
◦ Split the records based on an attribute test that optimizes
certain criterion.
Issues
◦ Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
◦ Determine when to stop splitting
35. Greedy strategy.
◦ Split the records based on an attribute test that optimizes
certain criterion.
Issues
◦ Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
◦ Determine when to stop splitting
36. Depends on attribute types
◦ Nominal
◦ Ordinal
◦ Continuous
Depends on number of ways to split
◦ 2-way split
◦ Multi-way split
37. Multi-way split: Use as many partitions as distinct
values.
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury} {Sports}
CarType
{Sports,
Luxury} {Family} OR
38. Multi-way split: Use as many partitions as distinct
values.
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
What about this split?
Size
Small
Medium
Large
Size
{Medium,
Large} {Small}
Size
{Small,
Medium} {Large}
OR
Size
{Small,
Large} {Medium}
39. Different ways of handling
◦ Discretization to form an ordinal categorical attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
◦ Binary Decision: (A < v) or (A ≥ v)
consider all possible splits and finds the best cut
can be more compute intensive
40.
41. Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the
quality of split is computed as,
where, ni = number of records at child i,
n = number of records at node p.
∑=
=
k
i
i
split iGINI
n
n
GINI
1
)(
43. For each distinct value, gather counts for each class in
the dataset
Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 3 1
C2 2 4
Gini 0.400
CarType
{Sports}
{Family,
Luxury}
C1 2 2
C2 1 5
Gini 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split
(find best partition of values)
44. Use Binary Decisions based on one
value
Several Choices for the splitting value
◦ Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
◦ Class counts in each of the
partitions, A < v and A ≥ v
Simple method to choose best v
◦ For each v, scan the database to
gather count matrix and compute its
Gini index
◦ Computationally Inefficient!
Repetition of work.
45. For efficient computation: for each attribute,
◦ Sort the attribute on values
◦ Linearly scan these values, each time updating the count matrix and
computing gini index
◦ Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
46. Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
◦ Measures homogeneity of a node.
Maximum (log nc) when records are equally distributed among
all classes implying least information
Minimum (0.0) when all records belong to one class, implying
most information
◦ Entropy based computations are similar to the GINI index
computations
∑−= j
tjptjptEntropy )|(log)|()(
47. Many Algorithms:
◦ Hunt’s Algorithm (one of the earliest)
◦ CART
◦ ID3, C4.5
◦ SLIQ,SPRINT