This document provides an introduction to machine learning and decision trees. It defines key concepts like deep learning, artificial intelligence, and machine learning. It then discusses different machine learning algorithms like supervised learning, unsupervised learning, and decision trees. The document explains how decision trees are built by choosing features to split on at each node based on metrics like information gain and entropy. It provides an example of calculating entropy and information gain to select the best feature to split the root node on.
Using Topological Data Analysis on your BigDataAnalyticsWeek
Synopsis:
Topological Data Analysis (TDA) is a framework for data analysis and machine learning and represents a breakthrough in how to effectively use geometric and topological information to solve 'Big Data' problems. TDA provides meaningful summaries (in a technical sense to be described) and insights into complex data problems. In this talk, Anthony will begin with an overview of TDA and describe the core algorithm that is utilized. This talk will include both the theory and real world problems that have been solved using TDA. After this talk, attendees will understand how the underlying TDA algorithm works and how it improves on existing “classical” data analysis techniques as well as how it provides a framework for many machine learning algorithms and tasks.
Speaker:
Anthony Bak, Senior Data Scientist, Ayasdi
Prior to coming to Ayasdi, Anthony was at Stanford University where he did a postdoc with Ayasdi co-founder Gunnar Carlsson, working on new methods and applications of Topological Data Analysis. He completed his Ph.D. work in algebraic geometry with applications to string theory at the University of Pennsylvania and ,along the way, he worked at the Max Planck Institute in Germany, Mount Holyoke College in Germany, and the American Institute of Mathematics in California.
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
Using Topological Data Analysis on your BigDataAnalyticsWeek
Synopsis:
Topological Data Analysis (TDA) is a framework for data analysis and machine learning and represents a breakthrough in how to effectively use geometric and topological information to solve 'Big Data' problems. TDA provides meaningful summaries (in a technical sense to be described) and insights into complex data problems. In this talk, Anthony will begin with an overview of TDA and describe the core algorithm that is utilized. This talk will include both the theory and real world problems that have been solved using TDA. After this talk, attendees will understand how the underlying TDA algorithm works and how it improves on existing “classical” data analysis techniques as well as how it provides a framework for many machine learning algorithms and tasks.
Speaker:
Anthony Bak, Senior Data Scientist, Ayasdi
Prior to coming to Ayasdi, Anthony was at Stanford University where he did a postdoc with Ayasdi co-founder Gunnar Carlsson, working on new methods and applications of Topological Data Analysis. He completed his Ph.D. work in algebraic geometry with applications to string theory at the University of Pennsylvania and ,along the way, he worked at the Max Planck Institute in Germany, Mount Holyoke College in Germany, and the American Institute of Mathematics in California.
This presentation covers Decision Tree as a supervised machine learning technique, talking about Information Gain method and Gini Index method with their related Algorithms.
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
Presented an abridged version of my "What is data science" talk at #websummit 2013.
This talk goes over the required skillset as defined by Drew Conway and his famous venn diagram, and also outlines the Data Scientific Method brought by Dr. Patil. The talk is mainly two parts and the second part goes over some of the packages and technologies we use — minus the storage part.
Abstract: This PDSG workshop introduces basic concepts on machine learning. The course covers fundamentals of Supervised and Unsupervised Learning, Decision Trees, Pruning, Ensemble Trees, Linear Regressions, Loss Functions, K-means, and dataset preparation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Slides for "ROCKER – A Refinement Operator for Key Discovery", WWW2015Tommaso Soru
ROCKER is a refinement-operator-based approach for finding keys of a class in an RDF dataset. This presentation was held at the 24th International World Wide Web Conference in Florence, Italy.
A set of slides from my closing keynote at DamnData. I go over the concept of the Data Scientific Method, the skills required for a data scientists from hacking, to maths and stats, to expertise to business knowledge. I also talk about some ideas we worked on, some tools we use, some technologies and the most important part, the questioning of the data.
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
This Edureka Decision Tree tutorial will help you understand all the basics of Decision tree. This decision tree tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn decision tree analysis along with examples.
Below are the topics covered in this tutorial:
1) Machine Learning Introduction
2) Classification
3) Types of classifiers
4) Decision tree
5) How does Decision tree work?
6) Demo in R
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Querying your database in natural language was a presentation done during PyData Silicon Valley 2014, based on the quepy software project. More information at:
http://pydata.org/sv2014/abstracts/#197
https://github.com/machinalis/quepy
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
Most end users can't write a database query, and yet, they often have the need to access information that keyword-based searches can't retrieve precisely. Lately, there's been an explosion of proprietary Natural Language Interfaces to knowledge databases, like Siri, Google Now and Wolfram Alpha. On the open side, huge knowledge bases like DBpedia and Freebase exists, but access to them is typically limited to using formal database query languages. We implemented Quepy as an approach to provide a solution for this problem. Quepy is an open source framework to transform Natural Language questions into semantic database queries that can be used with popular knowledge databases like, for example, DBPedia and Freebase. So instead of requiring end users to learn to write some query language, a Quepy Application can fills the gap, allowing end users to make their queries in "plain English". In this talk we would discuss the techniques used in Quepy, what additional work can be done, and its limitations.
Practical deep learning for computer visionEran Shlomo
This is the presentation given in TLV DLD 2017. In this presentation we walk through the planning and implemintation of deeplearning solution for image recognition, with focus on the data.
It is based on the work we do at dataloop.ai and its customers.
Presented an abridged version of my "What is data science" talk at #websummit 2013.
This talk goes over the required skillset as defined by Drew Conway and his famous venn diagram, and also outlines the Data Scientific Method brought by Dr. Patil. The talk is mainly two parts and the second part goes over some of the packages and technologies we use — minus the storage part.
Abstract: This PDSG workshop introduces basic concepts on machine learning. The course covers fundamentals of Supervised and Unsupervised Learning, Decision Trees, Pruning, Ensemble Trees, Linear Regressions, Loss Functions, K-means, and dataset preparation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Slides for "ROCKER – A Refinement Operator for Key Discovery", WWW2015Tommaso Soru
ROCKER is a refinement-operator-based approach for finding keys of a class in an RDF dataset. This presentation was held at the 24th International World Wide Web Conference in Florence, Italy.
A set of slides from my closing keynote at DamnData. I go over the concept of the Data Scientific Method, the skills required for a data scientists from hacking, to maths and stats, to expertise to business knowledge. I also talk about some ideas we worked on, some tools we use, some technologies and the most important part, the questioning of the data.
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
This Edureka Decision Tree tutorial will help you understand all the basics of Decision tree. This decision tree tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn decision tree analysis along with examples.
Below are the topics covered in this tutorial:
1) Machine Learning Introduction
2) Classification
3) Types of classifiers
4) Decision tree
5) How does Decision tree work?
6) Demo in R
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Querying your database in natural language was a presentation done during PyData Silicon Valley 2014, based on the quepy software project. More information at:
http://pydata.org/sv2014/abstracts/#197
https://github.com/machinalis/quepy
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
Most end users can't write a database query, and yet, they often have the need to access information that keyword-based searches can't retrieve precisely. Lately, there's been an explosion of proprietary Natural Language Interfaces to knowledge databases, like Siri, Google Now and Wolfram Alpha. On the open side, huge knowledge bases like DBpedia and Freebase exists, but access to them is typically limited to using formal database query languages. We implemented Quepy as an approach to provide a solution for this problem. Quepy is an open source framework to transform Natural Language questions into semantic database queries that can be used with popular knowledge databases like, for example, DBPedia and Freebase. So instead of requiring end users to learn to write some query language, a Quepy Application can fills the gap, allowing end users to make their queries in "plain English". In this talk we would discuss the techniques used in Quepy, what additional work can be done, and its limitations.
Practical deep learning for computer visionEran Shlomo
This is the presentation given in TLV DLD 2017. In this presentation we walk through the planning and implemintation of deeplearning solution for image recognition, with focus on the data.
It is based on the work we do at dataloop.ai and its customers.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
3. Before We Begin...
• Deep Learning (Subset of ML) - Uses Deep Neural Networks (a shallow network has one hidden
layer, a deep network has more than one) to learn features of the data in a hierarchical manner (e.g.
pixels from one layer recombine to form a line in the next layer)
– computer vision
– speech recognition
– natural language processing
• Artificial Intelligence – Basically a computer program doing something “smart”
– A bunch of if-then statements
–Machine Learning
• Machine Learning (Subset of AI) – A broad umbrella term for the technology that finds patterns in your
existing data, and uses them to make predictions on new data points
– Fraud Detection
– Deep Learning
4. AI | ML | DL – Maybe a picture is better?
Great Resource:
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
by Pedro Domingos
5. Timeline Of Machine Learning
1950 1952 1957 1979 1986 1997 2011 2012 2014 2016
The Learning
Machine (Alan Turing)
Machine Playing Checker
(Author Samuel)
Perceptron
(Frank Rosenblatt)
Stanford Cart
Backpropagation
(D. Rumelhart, G. Hinton, R. Williams)
Deep Blue Beats
Kasparov
Watson Wins Jeopardy
DeepMind Wins GoGoogle NN recognizing
cat in Youtube
Facebook DeepFace, Amazon
Echo, Turing Test Passed
6. Explosion in AI and ML Use Cases
Image recognition and tagging for photo organization
Object detection, tracking and navigation for Autonomous Vehicles
Speech recognition & synthesis in Intelligent Voice Assistants
Algorithmic trading strategy performance improvement
Sentiment analysis for targeted advertisements
17. Supervised Learning – How Machines Learn
Human intervention and validation required
e.g. Photo classification and tagging
Input
Label
Machine
Learning
Algorithm
Labrador
Prediction
Cat
Training Data
?
Label
Labrador
Adjust Model
18. Unsupervised Learning (learning without labels)
No human intervention required
(e.g. Customer segmentation)
Input
Machine
Learning
Algorithm
Prediction
19. Machine Learning Use Cases
Supervised Learning
Ø Classification
• Spam detection
• Customer churn prediction
Ø Regression
• House price prediction
• Demand forecasting
Unsupervised Learning
Ø Clustering
• Customer segmentation
There are other types as well
(Reinforcement Learning, for example)
but these two are the primary areas today
20. There are Lots of Machine Learning Algorithms
machinelearningmastery.com
21. There are Lots of Machine Learning Algorithms
machinelearningmastery.com
22. Color Size Fruit
Red Big Apple
Red Small Apple
Yellow Small Lemon
Red Big Apple
Green Big Apple
Yellow Big Lemon
Green Small Lemon
Red Big Apple
Yellow Big Lemon
Green Big Apple
Input Feature Target Label
Some Dataset
23. Decision Tree might look like …
Size of the fruit ?Apple
Color of the fruit ?
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting
26. But the question is…given a dataset, how can
we build a tree like this ?
Size of the fruit ?Apple
Color of the fruit ?
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting
27. But the question is…given a dataset, how can
we build a tree like this ?
Size of the fruit ?Apple
Color of the fruit ?
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting
29. General DT structure
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Size of the fruit ?Apple
Color of the fruit ?
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting
30. Training flow of a Decision Tree
• Prepare the labelled data set
• Try to pick the best feature as the root node
• Grow the tree until we get a stopping criteria
• Pass through the prediction data query through the tree
until we arrive at some le
• Once we get the leaf node, we have the prediction!! :)
31. Feature 1 Feature 2 Feature 3 Feature 4 Target Label
Training data, everything is known
32. Feature 1 Feature 2 Feature 3 Feature 4 Target Label
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Training data, everything is known
33. Feature 1 Feature 2 Feature 3 Feature 4 Target Label
Feature 1 Feature 2 Feature 3 Feature 4 Target Label
???
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Training data, everything is known
Prediction Data, only Feature 1 to 4 is known
UNKNOWN
34. Feature 1 Feature 2 Feature 3 Feature 4 Target Label
Feature 1 Feature 2 Feature 3 Feature 4 Target Label
???
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Training data, everything is known
Prediction Data, only Feature 1 to 4 is known
UNKNOWN
Send the Query/Inference
35. Feature 1 Feature 2 Feature 3 Feature 4 Target Label
Feature 1 Feature 2 Feature 3 Feature 4 Target Label
???
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Training data, everything is known
Prediction Data, only Feature 1 to 4 is known
UNKNOWN
Send the Query/Inference
Get the prediction
37. Entropy
It is the notion of the impurity of the data, now what is this new term impurity of
the data?
38. Entropy
It is the notion of the impurity of the data, now what is this new term impurity of
the data?
pure
39. Entropy
It is the notion of the impurity of the data, now what is this new term impurity of
the data?
pure less pure
40. Entropy
It is the notion of the impurity of the data, now what is this new term impurity of
the data?
impurepure less pure
41. Entropy
H(x) = - ∑ P(k) * log2(P(k))
k = ranges from 1 through n
H(x) = Entropy of x
P(k) = Probability of random variable x when x=k
42. Entropy
H(x) = - ∑ P(k) * log2(P(k))
k = ranges from 1 through n
H(x) = Entropy of x
P(k) = Probability of random variable x when x=k
43. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D
44. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D
X = “Play Ball”
45. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D
P(k=Yes) => 9/14 = 0.64
X = “Play Ball”
46. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
X = “Play Ball”
47. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47
X = “Play Ball”
53. Information Gain(IG)
Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D
54. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Outlook Temperature Humidity Windy Play ball
Overcast Hot High FALSE Yes
Overcast Cool Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Outlook Temperature Humidity Windy Play ball
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Sunny Mild Normal FALSE Yes
Sunny Mild High TRUE No
Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25
55. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Outlook Temperature Humidity Windy Play ball
Overcast Hot High FALSE Yes
Overcast Cool Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Outlook Temperature Humidity Windy Play ball
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Sunny Mild Normal FALSE Yes
Sunny Mild High TRUE No
Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25
56. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Outlook Temperature Humidity Windy Play ball
Overcast Hot High FALSE Yes
Overcast Cool Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Outlook Temperature Humidity Windy Play ball
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Sunny Mild Normal FALSE Yes
Sunny Mild High TRUE No
Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25
57. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Outlook Temperature Humidity Windy Play ball
Overcast Hot High FALSE Yes
Overcast Cool Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Outlook Temperature Humidity Windy Play ball
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Sunny Mild Normal FALSE Yes
Sunny Mild High TRUE No
Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25
58. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Outlook Temperature Humidity Windy Play ball
Overcast Hot High FALSE Yes
Overcast Cool Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Outlook Temperature Humidity Windy Play ball
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Sunny Mild Normal FALSE Yes
Sunny Mild High TRUE No
Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25
59. Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Outlook Temperature Humidity Windy Play ball
Overcast Hot High FALSE Yes
Overcast Cool Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Outlook Temperature Humidity Windy Play ball
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Sunny Mild Normal FALSE Yes
Sunny Mild High TRUE No
Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25
60. IGOutlook = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Outlook
= 0.94 – 0.69
= 0.25
61. IGOutlook = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Outlook
= 0.94 – 0.69
= 0.25
IGTemperature = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Temperature
= 0.94 - 0.91
= 0.03
IGHumidity = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Humidity
= 0.94 - 0.79
= 0.15
IGWindy = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Windy
= 0.94 - 0.90
= 0.04
62. Maximum IG ? - Outlook
IGOutlook = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Outlook
= 0.94 – 0.69
= 0.25
IGTemperature = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Temperature
= 0.94 - 0.91
= 0.03
IGHumidity = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Humidity
= 0.94 - 0.79
= 0.15
IGWindy = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Windy
= 0.94 - 0.90
= 0.04
63. Here is the algorithmic steps
1. First the entropy of the total dataset is calculated for the target
label/class.
64. Here is the algorithmic steps
1. First the entropy of the total dataset is calculated for the target
label/class.
2. The dataset is then split on different features.
a) The entropy for each branch is calculated. Then it is added
proportionally, to get total weighted entropy for the split.
b) The resulting entropy is subtracted from the entropy before the split.
c) The result is the Information Gain.
65. Here is the algorithmic steps
1. First the entropy of the total dataset is calculated for the target
label/class.
2. The dataset is then split on different features.
a) The entropy for each branch is calculated. Then it is added
proportionally, to get total weighted entropy for the split.
b) The resulting entropy is subtracted from the entropy before the split.
c) The result is the Information Gain.
3. The feature that yields the largest IG is chosen for the decision node.
66. Here is the algorithmic steps
1. First the entropy of the total dataset is calculated for the target
label/class.
2. The dataset is then split on different features.
a) The entropy for each branch is calculated. Then it is added
proportionally, to get total weighted entropy for the split.
b) The resulting entropy is subtracted from the entropy before the split.
c) The result is the Information Gain.
3. The feature that yields the largest IG is chosen for the decision node.
4. Repeat step #2 and #3, for each subset of the data(for each internal
node) until:
a) All the dependent features are exhausted
b) The stopping criteria are met.
67. Thankfully, we do not have to do all this(like calculating
Entropy, IG, etc.), we have lots of libraries/packages
available in Python which we can use to solve a problem
with decision tree.
69. Amazon
Rekognition
Amazon
Personalize
Amazon
Textract
Amazon
Comprehend
Amazon
Translate
Amazon
Polly
Amazon
Transcribe
+ Medical
Amazon
Lex
V I S I O N T E X T C H A T B O T SS P E E C H P E R S O N A L I Z A T I O N
Ground Truth
data labelling
ML
Marketplace
SageMaker Studio IDE
SageMaker
Notebooks
SageMaker
Experiments
SageMaker
Debugger
SageMaker
Autopilot
SageMaker
Model Monitor
Model
training
Model
tuning
Model
hosting
Built-in
algorithms
SageMaker
Neo
N E W !
N E W !
N E W ! N E W !N E W !
Deep Learning
AMIs & Containers
GPUs and
CPUs
Inferentia
Elastic
Inference
FPGA
N E W !
N E W ! N E W ! N E W !
A M A Z O N
S A G E M A K E R
M L F R A M E W O R K S
& I N F R A S T R U C T U R E
A I S E R V I C E S
Amazon
Forecast
F O R E C A S T I N G
AWS ML Stack
Broadest and most complete set of Machine Learning capabilities
70. Amazon
Rekognition
Amazon
Personalize
Amazon
Textract
Amazon
Comprehend
Amazon
Translate
Amazon
Polly
Amazon
Transcribe
+ Medical
Amazon
Lex
V I S I O N T E X T C H A T B O T SS P E E C H P E R S O N A L I Z A T I O N
Ground Truth
data labelling
ML
Marketplace
SageMaker Studio IDE
SageMaker
Notebooks
SageMaker
Experiments
SageMaker
Debugger
SageMaker
Autopilot
SageMaker
Model Monitor
Model
training
Model
tuning
Model
hosting
Built-in
algorithms
SageMaker
Neo
N E W !
N E W !
N E W ! N E W !N E W !
Deep Learning
AMIs & Containers
GPUs and
CPUs
Inferentia
Elastic
Inference
FPGA
N E W !
N E W ! N E W ! N E W !
A M A Z O N
S A G E M A K E R
M L F R A M E W O R K S
& I N F R A S T R U C T U R E
A I S E R V I C E S
Amazon
Forecast
F O R E C A S T I N G
AWS ML Stack
Broadest and most complete set of Machine Learning capabilities