Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Ensemble Learning is a technique that creates multiple models and then combines them to produce improved results.
Ensemble learning usually produces more accurate solutions than a single model would.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
K-Folds cross-validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
Ensemble Learning is a technique that creates multiple models and then combines them to produce improved results.
Ensemble learning usually produces more accurate solutions than a single model would.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
K-Folds cross-validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built.
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
Decision Trees and Ensemble Methods is a different form of Machine Learning algorithm classes. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...zohebmusharraf
goog is a great place to work sincerely excellent communication skills self motivated ability to work sincerely excellent communication skills self motivated
Detailed discussion about decision tree regressor and the classifier with finding the right algorithm to split
Let me know if anything is required. Ping me at google #bobrupakroy
Get to know in detail the termonologies of Random Forest with their types of algorithms used in the workflow along with their advantages and disadvantages of their predecessors.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut
Its all about Machine learning .Machine learning is a field of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit programming instructions. Instead, these algorithms learn from data, identifying patterns, and making decisions or predictions based on that data.
There are several types of machine learning approaches, including:
Supervised Learning: In this approach, the algorithm learns from labeled data, where each example is paired with a label or outcome. The algorithm aims to learn a mapping from inputs to outputs, such as classifying emails as spam or not spam.
Unsupervised Learning: Here, the algorithm learns from unlabeled data, seeking to find hidden patterns or structures within the data. Clustering algorithms, for instance, group similar data points together without any predefined labels.
Semi-Supervised Learning: This approach combines elements of supervised and unsupervised learning, typically by using a small amount of labeled data along with a large amount of unlabeled data to improve learning accuracy.
Reinforcement Learning: This paradigm involves an agent learning to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties, enabling it to learn the optimal behavior to maximize cumulative rewards over time.Machine learning algorithms can be applied to a wide range of tasks, including:
Classification: Assigning inputs to one of several categories. For example, classifying whether an email is spam or not.
Regression: Predicting a continuous value based on input features. For instance, predicting house prices based on features like square footage and location.
Clustering: Grouping similar data points together based on their characteristics.
Dimensionality Reduction: Reducing the number of input variables to simplify analysis and improve computational efficiency.
Recommendation Systems: Predicting user preferences and suggesting items or actions accordingly.
Natural Language Processing (NLP): Analyzing and generating human language text, enabling tasks like sentiment analysis, machine translation, and text summarization.
Machine learning has numerous applications across various domains, including healthcare, finance, marketing, cybersecurity, and more. It continues to be an area of active research and
A decision tree is a guide to the potential results of a progression of related choices. It permits an individual or association to gauge potential activities against each other dependent on their costs, probabilities, and advantages. They can be utilized either to drive casual conversation or to outline a calculation that predicts the most ideal decision scientifically.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
Using InfluxDB for real-time monitoring in JmeterKnoldus Inc.
Explore the integration of InfluxDB with JMeter for real-time performance monitoring. This session will cover setting up InfluxDB to capture JMeter metrics, configuring JMeter to send data to InfluxDB, and visualizing the results using Grafana. Learn how to leverage this powerful combination to gain real-time insights into your application's performance, enabling proactive issue detection and faster resolution.
Intoduction to KubeVela Presentation (DevOps)Knoldus Inc.
KubeVela is an open-source platform for modern application delivery and operation on Kubernetes. It is designed to simplify the deployment and management of applications in a Kubernetes environment. KubeVela is a modern software delivery platform that makes deploying and operating applications across today's hybrid, multi-cloud environments easier, faster and more reliable. KubeVela is infrastructure agnostic, programmable, yet most importantly, application-centric. It allows you to build powerful software, and deliver them anywhere!
Stakeholder Management (Project Management) PresentationKnoldus Inc.
A stakeholder is someone who has an interest in or who is affected by your project and its outcome. This may include both internal and external entities such as the members of the project team, project sponsors, executives, customers, suppliers, partners and the government. Stakeholder management is the process of managing the expectations and the requirements of these stakeholders.
Introduction To Kaniko (DevOps) PresentationKnoldus Inc.
Kaniko is an open-source tool developed by Google that enables building container images from a Dockerfile inside a Kubernetes cluster without requiring a Docker daemon. Kaniko executes each command in the Dockerfile in the user space using an executor image, which runs inside a container, such as a Kubernetes pod. This allows building container images in environments where the user doesn’t have root access, like a Kubernetes cluster.
Efficient Test Environments with Infrastructure as Code (IaC)Knoldus Inc.
In the rapidly evolving landscape of software development, the need for efficient and scalable test environments has become more critical than ever. This session, "Streamlining Development: Unlocking Efficiency through Infrastructure as Code (IaC) in Test Environments," is designed to provide an in-depth exploration of how leveraging IaC can revolutionize your testing processes and enhance overall development productivity.
Exploring Terramate DevOps (Presentation)Knoldus Inc.
Terramate is a code generator and orchestrator for Terraform that enhances Terraform's capabilities by adding features such as code generation, stacks, orchestration, change detection, globals, and more . It's primarily designed to help manage Terraform code at scale more efficiently . Terramate is particularly useful for managing multiple Terraform stacks, providing support for change detection and code generation 2. It allows you to create relationships between stacks to improve your understanding and control over your infrastructure . One of the key features of Terramate is its ability to detect changes at both the stack and module level. This capability allows you to identify which stacks and resources have been altered and selectively determine where you should execute commands.
Clean Code in Test Automation Differentiating Between the Good and the BadKnoldus Inc.
This session focuses on the principles of writing clean, maintainable, and efficient code in the context of test automation. The session will highlight the characteristics that distinguish good test automation code from bad, ultimately leading to more reliable and scalable testing frameworks.
Integrating AI Capabilities in Test AutomationKnoldus Inc.
Explore the integration of artificial intelligence in test automation. Understand how AI can enhance test planning, execution, and analysis, leading to more efficient and reliable testing processes. Explore the cutting-edge integration of Artificial Intelligence (AI) capabilities in Test Automation, a transformative approach shaping the future of software testing. This session will delve into practical applications, benefits, and considerations associated with infusing AI into test automation workflows.
State Management with NGXS in Angular.pptxKnoldus Inc.
NGXS is a state management pattern and library for Angular. NGXS acts as a single source of truth for your application's state - providing simple rules for predictable state mutations. In this session we will go through the main for components of NGXS -Store, Actions, State, and Select.
Authentication in Svelte using cookies.pptxKnoldus Inc.
Svelte streamlines authentication with cookies, offering a secure and seamless user experience. Effortlessly manage sessions by storing tokens in cookies, ensuring persistent logins. With Svelte's simplicity, implement robust authentication mechanisms, enhancing user security and interaction.
OAuth2 Implementation Presentation (Java)Knoldus Inc.
The OAuth 2.0 authorization framework is a protocol that allows a user to grant a third-party web site or application access to the user's protected resources, without necessarily revealing their long-term credentials or even their identity. It is commonly used in scenarios such as user authentication in web and mobile applications and enables a more secure and user-friendly authorization process.
Supply chain security with Kubeclarity.pptxKnoldus Inc.
Kube clarity is a comprehensive solution designed to enhance supply chain security within Kubernetes environments. Kube clarity enables organizations to identify and mitigate potential security threats throughout the software development and deployment process.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
In this session, we will delve into the world of web scraping with JSoup, an open-source Java library. Here we are going to learn how to parse HTML effectively, extract meaningful data, and navigate the Document Object Model (DOM) for powerful web scraping capabilities.
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
Dive into the fundamental aspects of Akka gRPC and learn to leverage its power in building compact and efficient distributed systems. This session aims to equip attendees with the essential skills and knowledge to leverage Akka and gRPC effectively in building robust, scalable, and distributed applications.
Entity Core with Core Microservices.pptxKnoldus Inc.
How Developers can use Entity framework(ORM) which provides a structured and consistent way for microservices to interact with their respective database, prompting independence, scaliblity and maintainiblity in a distributed system, and also provide a high-level abstraction for data access.
Introduction to Redis and its features.pptxKnoldus Inc.
Join us for an interactive session where we'll cover the fundamentals of Redis, practical use cases, and best practices for incorporating Redis into your projects. Whether you're a developer, architect, or system administrator, this session will equip you with the knowledge to harness the full potential of Redis for your applications. Get ready to elevate your understanding of in-memory data storage and revolutionize the way you handle data in your projects with Redis
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
In this Webinar, will talk on GraphQL with .NET, that provides a modern and flexible approach to building APIs. It empowers developers to create efficient and tailored APIs that meet the specific needs of their applications and clients.
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
These packages and topics cover various aspects of .NET development, offering solutions for common needs in software development, including logging, database interaction, API communication, testing, security, and more. Depending on the requirements of your project, incorporating these packages can significantly enhance the development process.
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
Data Quality in Test Automation: Navigating the Path to Reliable Testing" delves into the crucial role of data quality within the realm of test automation. It explores strategies and methodologies for ensuring reliable testing outcomes by addressing challenges related to the accuracy, completeness, and consistency of test data. The discussion encompasses techniques for managing, validating, and optimizing data sets to enhance the effectiveness and efficiency of automated testing processes, ultimately fostering confidence in the reliability of software systems.
K8sGPTThe AI way to diagnose KubernetesKnoldus Inc.
K8sGPT is a tool for scanning your Kubernetes clusters, diagnosing and triaging issues in simple english. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI. In the session we will integrate OpenAI with K8sGPT to diagnose our Kubernetes Cluster.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
2. Agenda
● What are Decision trees?
● Appropriate problems for DTrees
● Information gain , Entropy
● Issues with DTrees
● Overfitting
● Pruning
● Demo
3. What are Decision trees?
● A decision tree is a tree in which each branch
node represents a choice between a number
of alternatives, and each leaf node represents
a decision.
● A type of supervised learning algorithm.
4. ● It is one of the most widely used and practical
methods for Inductive Inference.
6. Appropriate problems for DTrees
● Instances are represented by attribute-value pairs
● The target function has discrete output values
● The training data may contain errors
● The training data may have missing attribute values
7. How to select the deciding node?
Which is the best Classifier?
9. ➔ Less impure node requires less information to describe it.
➔ More impure node requires more information.
===> Information theory is a measure to
define this degree of disorganization in a
system known as Entropy.
So, we can conclude
10. Entropy - measuring
homogeneity of a learning set
● Entropy is a measure of the uncertainty about a source of
messages.
● Given a collection S, containing positive and negative
examples of some target concept, the entropy of S relative to
this classification.
where, pi is the proportion of S belonging to class i
11. ●
Entropy is 0 if all the members
of S belong to the same class.
● Entropy is 1 when the collection
contains an equal no. of +ve
and -ve examples.
●
Entropy is between 0 and 1 if
the collection contains unequal
no. of +ve and -ve examples.
12. Information gain
● Decides which attribute goes into a decision node.
● To minimize the decision tree depth, the attribute with the
most entropy reduction is the best choice!
13. Where:
● S is each value v of all possible values of attribute A
● Sv = subset of S for which attribute A has value v
● |Sv| = number of elements in Sv
● |S| = number of elements in S
The information gain, Gain(S,A) of an attribute A,
15. ISSUES
● How deeply to grow the decision tree?
● Handling continuous attributes
● Choosing an appropriate attribute selection
measure
● Handling training data with missing attribute
values
16. Overfitting in Decision Trees
● If a decision tree is fully grown, it may lose
some generalization capability.
● This is a phenomenon known as overfitting.
17. Overfitting
A hypothesis overfits the training examples if
some other hypothesis that fits the training
examples less well actually performs better
over the entire distribution of instances (i.e,
including instances beyond the training
examples)
19. Causes of Overfitting
● Overfitting Due to Presence of Noise -
Mislabeled instances may contradict the class
labels of other similar records.
● Overfitting Due to Lack of Representative
Instances - Lack of representative instances
in the training data can prevent refinement of
the learning algorithm.
24. Overfitting Due to Lack of
Samples
● Although the model’s training error
is zero, its error rate on the test set
is 30%.
● Humans, elephants, and dolphins
are misclassified because the
decision tree classifies all
warmblooded vertebrates that do not
hibernate as non-mammals.
25. A good model must not only fit the
training data well
but also accurately classify records
it has never seen.
26. Avoiding overfitting in decision
tree learning
● approaches that stop growing the tree earlier,
before it reaches the point where it perfectly
classifies the training data.
● approaches that allow the tree to overfit the
data, and then post-prune the tree.
27. Approaches to implement
● separate set of examples
● a statistical test: chi-square test
● a heuristic called the Minimum Description
Length principle - (explicit measure of the
complexity for encoding the training examples
and the decision tree, halting growth of the
tree when this encoding size is minimized.)
28. PRUNING
● Consider each of the decision nodes in the tree to be candidates for
pruning.
● Pruning a decision node consists of removing the subtree rooted at
that node, making it a leaf node, and assigning it the most common
classification of the training examples affiliated with that node.
● Nodes are removed only if the resulting pruned tree performs no
worse than the original over the validation set.
● Pruning of nodes continues until further pruning is harmful (i.e.,
decreases accuracy of the tree over the validation set).