The document discusses decision trees, which are a type of predictive modeling that can be used for segmentation. It provides examples of how to segment a population of customers into subgroups based on attributes like employment status and income. The key aspects of decision trees covered include how they are constructed from a root node down to leaf nodes, different algorithms for building decision trees, measures for determining the best attributes to split on like information gain, and techniques for validating and pruning trees to avoid overfitting.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
This is the most simplest and easy to understand ppt. Here you can define what is decision tree,information gain,gini impurity,steps for making decision tree there pros and cons etc which will helps you to easy understand and represent it.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
There are 100,000 applicants for loans. Who is likely to default? How to effectively offer a loan
There are 100,000 consumers who is likely to buy my product? How to effectively market my product?
There are more than 1,000,000,000 transactions in a day. How to identify the fraud transaction?
There are 1,000,000 claims every year. How to identify the fake claims
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what is Machine Learning, problems in Machine Learning, what is Decision Tree, advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with solved examples and at the end we will implement a Decision Tree use case/ demo in Python on loan payment prediction. This Decision Tree tutorial is ideal for both beginners as well as professionals who want to learn Machine Learning Algorithms.
Below topics are covered in this Decision Tree Algorithm Presentation:
1. What is Machine Learning?
2. Types of Machine Learning?
3. Problems in Machine Learning
4. What is Decision Tree?
5. What are the problems a Decision Tree Solves?
6. Advantages of Decision Tree
7. How does Decision Tree Work?
8. Use Case - Loan Repayment Prediction
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Data Sanitization and Disposal: Best PracticesAvritek
A presentation that covers compliance, techniques, and common myths relating to data destruction for mobile devices, hard drives (HHD) and solid state drive (SSD).
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
This is the most simplest and easy to understand ppt. Here you can define what is decision tree,information gain,gini impurity,steps for making decision tree there pros and cons etc which will helps you to easy understand and represent it.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
There are 100,000 applicants for loans. Who is likely to default? How to effectively offer a loan
There are 100,000 consumers who is likely to buy my product? How to effectively market my product?
There are more than 1,000,000,000 transactions in a day. How to identify the fraud transaction?
There are 1,000,000 claims every year. How to identify the fake claims
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what is Machine Learning, problems in Machine Learning, what is Decision Tree, advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with solved examples and at the end we will implement a Decision Tree use case/ demo in Python on loan payment prediction. This Decision Tree tutorial is ideal for both beginners as well as professionals who want to learn Machine Learning Algorithms.
Below topics are covered in this Decision Tree Algorithm Presentation:
1. What is Machine Learning?
2. Types of Machine Learning?
3. Problems in Machine Learning
4. What is Decision Tree?
5. What are the problems a Decision Tree Solves?
6. Advantages of Decision Tree
7. How does Decision Tree Work?
8. Use Case - Loan Repayment Prediction
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Data Sanitization and Disposal: Best PracticesAvritek
A presentation that covers compliance, techniques, and common myths relating to data destruction for mobile devices, hard drives (HHD) and solid state drive (SSD).
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDustiBuckner14
Dr. Oner Celepcikay
ITS 632
ITS 632
Week 4
Classification
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Machine Learning Methods - Classification
ITS 632
Given a collection of records (training set)
- Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
A test set is used to estimate the accuracy of the model.
Goal: previously unseen records (test set) should be assigned a class as accurately as possible.
Machine Learning – Classification Example
ITS 632
categorical
categorical
continuous
class
Test
Set
Training
Set
Model
Learn
Classifier
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Splitting Attributes
Model: Decision Tree
Machine Learning – Classification Example
categorical
categorical
continuous
ITS 632
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
There could be more than one tree that fits the same data!
categorical
categorical
continuous
Another Example of Decision Tree
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Apply Model to Test Data
ITS 632
Assign “Cheat” No
No
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Machine Learning – Classification Example
ITS 632
categorical
categorical
continuous
class
Model
Learning
Algorithm
Induction
Deduction
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure:
If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default class, yd
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to e ...
Test design made easy (and fun) Rik Marselis EuroSTARRik Marselis
Workshop of Rik Marselis at the EuroSTAR conference 2015 in Maastricht at 5 November 2015.
The subject is test design and in this presentation I demonstrate that using a mix of experience based and coverage based testing the best results of testing can be achieved. This is based on TMap HD, which also contains an interesting grouping of test design techniques into 4 types of coverage.
The presentation is copyright of Sogeti Nederland B.V.
Sixteen (16) simple rules for building robust machine learning models. Invited talk for the AMA call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group (ECEIG).
Big data is set to offer tremendous insight. But with terabytes and petabytes of data pouring in to organizations today, traditional architectures and infrastructures are not up to the challenge. This begs the question: How do you present big data in a way that can be quickly understood and used? These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm and also how to improve the quality of the data.
Dr. Ashwin Satyanarayana is an Assistant Professor in the Computer Systems Technology department at CityTech. Prior to joining CityTech, Ashwin was a Research Scientist at Microsoft, where he worked on several Big Data problems including Query Reformulation on Microsoft's search engine Bing. Ashwin's prior experience also includes a Senior Research Scientist on the area of Location Analytics at Placed Inc. He holds a PhD in Computer Science (Data Mining) from SUNY, with particular emphasis on Data Mining, Machine Learning and Applied Probability with applications in Real World Learning Problems.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
By popular demand, here is a case study of my first Kaggle competition from about a year ago. Hope you find it useful. Thank you again to my fantastic team.
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxmadlynplamondon
Dr. Oner Celepcikay
CS 4319
CS 4319
Machine Learning
Week 6
Data Science Tool I – Classification Part II
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion.
IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values
Early termination (to be discussed later)
Practical Issues of ClassificationUnderfitting and Overfitting
Missing Values
Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of mammals.
Overfitting due to Noise
Decision boundary is distorted by noise point
Both humans and dolphins were misclassified as n0n-mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization ErrorsRe-substitution errors: error on training ( e(t) )Generalization errors: error on testing ( e’(t))
Methods for estimating generalization errors:Optimistic approach: e’(t) = e(t)Pessimistic approach: For each leaf node: e’(t) = (e(t)+0.5) Total errors: e’(T) = e(T) + N 0.5 (N: number of leaf nodes) For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%Reduced error pruning (REP): uses validation data set to estimate generalization
error
Occam’s RazorGiven two models of similar generalization errors, one should prefer the simpler model over the more complex model
For complex models, there is a greater chance that it was fitted accidentally by errors in data
Therefore, one should include model complexity when evaluating a model
How to Address OverfittingPre-Pruning (Early Stopping Rule)Stop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less tha ...
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
Existing state-of-the-art supervised methods in Machine Learning require large amounts of annotated data to achieve good performance and generalization. However, manually constructing such a training data set with sentiment labels is a labor-intensive and time-consuming task. With the proliferation of data acquisition in domains such as images, text and video, the rate at which we acquire data is greater than the rate at which we can label them. Techniques that reduce the amount of labeled data needed to achieve competitive accuracies are of paramount importance for deploying scalable, data-driven, real-world solutions.
At Envestnet | Yodlee, we have deployed several advanced state-of-the-art Machine Learning solutions that process millions of data points on a daily basis with very stringent service level commitments. A key aspect of our Natural Language Processing solutions is Semi-supervised learning (SSL): A family of methods that also make use of unlabelled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Pure supervised solutions fail to exploit the rich syntactic structure of the unlabelled data to improve decision boundaries. There is an abundance of published work in the field - but few papers have succeeded in showing significantly better results than state-of-the-art supervised learning. Often, methods have simplifying assumptions that fail to transfer to real-world scenarios. There is a lack of practical guidelines for deploying effective SSL solutions. We attempt to bridge that gap by sharing our learning from successful SSL models deployed in production
Ever wondered about the full form of Chat GPT?🤔 It stands for Chat Generative Pre-Trained Transformer. For those diving into the world of Transformers, I've been using this PPT during my lectures📚. Thought it might be handy for some of you too! Check it out and let me know what you think!🌟
How to validate a model?
What is a best model ?
Types of data
Types of errors
The problem of over fitting
The problem of under fitting
Bias Variance Tradeoff
Cross validation
K-Fold Cross validation
Boot strap Cross validation
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
Neural network Intuition
Neural network and vocabulary
Neural network algorithm
Math behind neural network algorithm
Building the neural networks
Validating the neural network model
Neural network applications
Image recognition using neural networks
Introduction to Analytics
Introduction to SAS
Introduction to Satistics
Introduction to Predictive Modeling
Introduction to Forecasting
Introduction to Bigdata
Step-1 Tableau Introduction
Step-2 Connecting to Data
Step-3 Building basic views
Step-4 Data manipulations and Calculated fields
Step-5 Tableau Dashboards
Step-6 Advanced Data Options
Step-7 Advanced graph Options
List of data sets and data set sources
Sample data sets for machine learning
Data sets for predictive modeling and visualizations
Economic and Social Data sets
Business and Financial datasets
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. What is the need of segmentation?
Problem:
• 10,000 Customers - we know their age, city name, income,
employment status, designation
• You have to sell 100 Blackberry phones(each costs $1000) to
the people in this group. You have maximum of 7 days
• If you start giving demos to each individual, 10,000 demos will
take more than one year. How will you sell maximum number
of phones by giving minimum number of demos?
DataAnalysisCourse
VenkatReddy
2
3. What is the need of segmentation?
Solution
• Divide the whole population into two groups employed / unemployed
• Further divide the employed population into two groups high/low salary
• Further divide that group into high /low designation
DataAnalysisCourse
VenkatReddy
3
10000
customers
Unemployed
3000
Employed
7000
Low salary
5000
High Salary
2000
Low
Designation
1800
High
Designation
200
4. Decision Trees
Decision Tree Vocabulary
• Drawn top-to-bottom or left-to-right
• Top (or left-most) node = Root Node
• Descendent node(s) = Child Node(s)
• Bottom (or right-most) node(s) = Leaf
Node(s)
• Unique path from root to each leaf = Rule
DataAnalysisCourse
VenkatReddy
4
Root
Child Child Leaf
LeafChild
Leaf
Decision Tree Types
• Binary trees – only two choices in each split. Can be non-uniform (uneven)
in depth
• N-way trees or ternary trees – three or more choices in at least one of its
splits (3-way, 4-way, etc.)
5. Decision Tree Algorithms
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3
• C4.5
• SLIQ
• SPRINT
• CHAID
DataAnalysisCourse
VenkatReddy
5
6. Decision Trees Algorithm – Answers?
DataAnalysisCourse
VenkatReddy
6
(2)Which Split to consider?
(4) When to stop/ come to conclusion?
(1) Which attribute to start?
(3) Which attribute to proceed with?
7. Example:Splittingwith respectto an attribute
• Example:We want to sell some appartments. The population contains 67
persons. We want to test response based on the spilts given two attributes
1)Owning a car 2)gender
DataAnalysisCourse
VenkatReddy
7
Total
population
67 [28+ 39-]
Yes -29
[25+, 4-]
No 38
[3+, 35-]
Split With Respect to ‘Owning a car’
Total
population
67 [28+ 39-]
Male - 40
[19+, 21-]
Female-27
[9+, -18]
Split With Respect to ‘Gender’
• In this example there are 21 positive responses from people owning a car & 8 positive
responses from people who doesn’t own a car
8. Example:Splittingwith respectto an attribute
DataAnalysisCourse
VenkatReddy
8
Split With Respect to ‘Owning a car’ Split With Respect to ‘marital status’
Total
population
67 [28+ 39-]
Yes -29
[25+, 4-]
No 38
[3+, 35-]
Total
population
67 [28+ 39-]
Yes - 40
[25, 15-]
No-27
[3+, 24-]
• Which is the best split attribute? Owing a car / Gender/ Marital status?
• The one which removes maximum impurity
9. Best Splitting attribute
• The splitting is done always based on the binary objective
variable(0/1 type)
• The best split at root(or child) nodes is defined as one that
does the best job of separating the data into groups where a
single class(either 0 or 1) predominates in each group
• Measure used to evaluate a potential split is purity
• The best split is one that increases purity of the sub-sets by the
greatest amount
DataAnalysisCourse
VenkatReddy
9
10. Purity (Diversity) Measures:
• Entropy: Characterizes the impurity/diversity of segment (an arbitrary collection
of observations)
• Measure of uncertainty/Impurity
• Expected number of bits to resolve uncertainty
• Entropy measures the information amount in a message
• S is a sample of training examples, p+ is the proportion of positive examples, p-
is the proportion of negative examples
• Entropy(S) = -p+ log2 p+ - p- log2 p-
• General formula for Entropy(S) = - pj x log2(pj)
• Entropy is maximum when p=0.5
• Chi-square measure of association
• Gini Index : Gini(T) = 1 - pj
2
• Information Gain Ratio
• Misclassification error
DataAnalysisCourse
VenkatReddy
10
12. Deciding the best split
DataAnalysisCourse
VenkatReddy
12
• Entropy([28+,39-]) Ovearll = -28/67 log2 28/67 – 39/67 log2 39/67 = 98% (Impurity)
• Entropy([25+,4-]) Owing a car = 57%
• Entropy([3+,35-]) No car = 40%
• Entropy([19+,21-]) Male= 99%
• Entropy([9+,18-]) Female = 91%
• Entropy([25+,15-]) Married= 95%
• Entropy([3,24-]) Unmarried = 50%
• Information Gain= entropyBeforeSplit – entropyAfterSplit
• Easy way to understnd Information gain= (ovearll entropy) – (sum of weighted entopy at each
node)
• Attribute with maximum information is best split attribute
Using Entropy
Using Chi Square Measure for association/Degree of independence
• Chi-square for owning a car = 2.71
• Chi square for Gender = 0.09
• Chi square for marital status =1.19
• The attribute with maximum chi square is the best split attibute
13. The Decision tree algorithm
Until stopped:
1. Select a leaf node
2. Select one of the unused attributes
• Partition the node population and calculate information gain.
• Find the split with maximum information gain for a this attribute
3. Repeat this for all attributes
• Find the best splitting attribute along with best split rule
4. Spilt the node using the attribute
5. Go to each child node and repeat step 2 to 4
Stopping criteria:
• Each leaf-node contains examples of one type
• Algorithm ran out of attributes
• No further significant information gain
DataAnalysisCourse
VenkatReddy
13
14. Decision Trees Algorithm – Answers?
DataAnalysisCourse
VenkatReddy
14
(2)Which Split to consider?
(4) When to stop/ come to conclusion?
(1) Which attribute to start?
(3) Which attribute to proceed with?
15. Tree validation
• Confusion Matrix:
DataAnalysisCourse
VenkatReddy
15
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
FNFPTNTP
TNTP
dcba
da
Accuracy
16. Tree validation
• Sometimes cost of misclassification is not equal for both good
and bad.
• We use a cost matrix along with confusion matrix
• C(i|j): Cost of misclassifying class j example as class i
DataAnalysisCourse
VenkatReddy
16
PREDICTED CLASS
ACTUAL
CLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
17. Tree Validation
• Model-1 and Model-2 which one of them is better?
DataAnalysisCourse
VenkatReddy
17
Model M1 PREDICTED CLASS
ACTUAL
CLASS
+ -
+ 150 40
- 60 250
Model M2 PREDICTED CLASS
ACTUAL
CLASS
+ -
+ 250 45
- 5 200
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
C(i|j) + -
+ -1 100
- 1 0
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
18. Validation - Example
DataAnalysisCourse
VenkatReddy
18
Total
population
67 [28+ 39-]
Yes -29
[25+, 4-]
No 38
[3+, 35-]
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 25
(TP)
3
(FN)
Class=No 4
(FP)
35
(TN)
If having a car is the criteria for buying a house then
%90
67
60
Accuracy
Accuracy
dcba
da
19. CHAID Segmentation
• CHAID- Chi-Squared Automatic Interaction Detector
• CHAID is a non-binary decision tree.
• The decision or split made at each node is still based on a single
variable, but can result in multiple branches.
• The split search algorithm is designed for categorical variables.
• Continuous variables must be grouped into a finite number of bins
to create categories.
• A reasonable number of “equal population bins” can be created for
use with CHAID.
• ex. If there are 1000 samples, creating 10 equal population bins
would result in 10 bins, each containing 100 samples.
• A Chi-square value is computed for each variable and used to
determine the best variable to split on.
DataAnalysisCourse
VenkatReddy
19
20. CHAID Algorithm
Until stopped:
1. Select a node
2. Select one of the unused attributes
• Partition the node population and calculate Chi square value
• Find the split with maximum Chi square for a this attribute
3. Repeat this for all attributes
• Find the best splitting attribute along with best split rule
4. Spilt the node using the attribute
5. Go to each child node and repeat step 2 to 4
Stopping criteria:
• Each leaf-node contains examples of one type
• Algorithm ran out of attributes
• No further significant information gain
DataAnalysisCourse
VenkatReddy
20
21. Over fitting
• Model is too complicated
• Model works well on training data and performs very badly on
test data
• Over fitting results in decision trees that are more complex
than necessary
• Training error no longer provides a good estimate of how well
the tree will perform on previously unseen records
• Need new ways for estimating errors
DataAnalysisCourse
VenkatReddy
21
22. Avoiding Over fitting-Pruning
• Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree
• Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
• More restrictive conditions:
• Stop if number of instances is less than some user-specified
threshold
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
• Post-pruning
• Grow decision tree to its entirety
• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error improves after trimming, replace sub-tree by a
leaf node.
DataAnalysisCourse
VenkatReddy
22