SlideShare a Scribd company logo
1
2
Understanding Decision Tree Algorithm
using R Programming Language
Presented By:
Akshat Sharma#1, Anuj Srivastava#2
#Computer Science Department, Integral
University Lucknow, Uttar Pradesh, India
1mailidakshat@gmail.com
2anuj@iul.ac.in
3
Abstract
Decision Tree is one of the most efficient technique to carry out
data mining, which can be easily implemented by using R, a powerful
statistical tool which is used by more than 2 million statisticians and
data scientists worldwide. Decision trees can be used in a variety of
disciplines, such as for predicting which patient characteristics are
associated with high risk of a disease. When used together, we can find
the relevant set of data, from the large data stored in the Enterprise
Data Warehouses (EDWs). This provides new opportunities for
organizations to derive new value from their most valuable and
abundantly available asset: information, to create a competitive edge.
4
Introduction
• The first three phases of Data Analytics Lifecycle- discovery, data
preparation, and model planning, involve various aspects of data
exploration. The success of a data analysis project requires a deep
understanding of the data, it requires a tool for data mining and presenting
the data. We can use decision tree as a tool for data mining and R for
presenting the data.
• Data Mining is an analytic process designed to explore data (usually large
amounts of data - also known as "big data") in search of consistent patterns
and/or systematic relationships between variables, and then to validate the
findings by applying the detected patterns to new subsets of data. The
ultimate goal of data mining is prediction. The most common type of data
mining is Predictive data mining and it is the one that has the most direct
business applications. [1] 5
• Fundamental learning methods that appear in applications related
to data mining are-
1. Clustering
2. Association Rules
3. Regression
4. Classification
5. Time Series Analysis
6. Text Analysis
6
Introduction to ‘R’
• The ‘R’ language is a project designed to create a free, open source
language which can be used as a replacement for the ‘S’ language, ‘R’ was
created by Ross Ihaka and Robert Gentleman at the University of Auckland,
New Zealand, and is currently developed by the R Development Core Team, of
which Chambers is a member. ‘R’ first appeared in 1993; 23 years ago.
• ‘R’ is named partly after the first names of the first two ‘R’ authors and
partly as a play on the name of ‘S’.
• ‘R’ is an open source implementation of ‘S’, and differs from S-plus largely in
its command-line only format.
• ‘S’ is a statistical programming language developed primarily by John
Chambers and Rick Becker and Allan Wilks of Bell Laboratories.
7
Figure 1. A schematic view of how ‘R’ works. [2]
8
Installing R
• R is freely downloadable from http://cran.r-project.org/ , as is a wide
range of documentation.
• Most users should download and install a binary version. This is a
version that has been translated (by compilers) into machine language
for execution on a particular type of computer with a particular
operating system. Installation on Microsoft Windows is
straightforward.
• A binary version is available for windows 98 or above from the web
page http://cran.r-project.org/bin/windows/base .
9
10
Packages in ‘R’
• The most important single innovation in ‘R’ is the package
system, which provides a cross-platform system for distributing
and testing code and data.
• The Comprehensive R Archive Network (http://cran.r-
project.org) distributes public packages, but packages are also
useful for internal distribution.
• A package consists of a directory with a DESCRIPTION file
and subdirectories with code, data, documentation, etc. The
Writing R Extensions manual documents the package system,
and package.skeleton() simplifies package creation. [3]
11
Package Description
base Base R functions
datasets Base R datasets
grDevices Graphics devices for base and grid graphics
graphics R functions for base graphics
methods Formally defined methods and classes for R
objects
stats R statistical functions
utils R utility functions
12
Table 1. ‘R’ packages, loaded on startup. [4]
The libraries shown in Table 1 are loaded on the ‘R’ startup.
Applications of ‘R’
1. Clinical Trials
2. Cluster Analysis
3. Computational Physics
4. Differential Equations
5. Econometrics
6. Environmental Studies
7. Experimental Design
8. Finance
9. Genetics
10. Graphical Models
13
11. Graphics and Visualizations
12. High Performance Computing
13. High Throughput Genomics
14. Machine Learning
15. Medical Imaging
16. Meta Analysis
17. Multivariate Statistics
18. Natural Language Processing
19. Official Statistics
20. Optimization
Prominent Users of ‘R’
14
INTRODUCTION TO DECISION TREE
• A decision tree is also called a prediction tree. A decision tree uses a
structure to specify sequences of decisions and consequences. Given
input X={X1, X2,..,Xn}, the goal is to predict a response or output
variable Y. [5] Each member of the set {X1,X2,…,Xn} is called an input
variable.
• A decision tree consists of nodes, and thus form a rooted tree, this
means that it is a directed tree with a node called root. There are no
incoming edges on root node, all other nodes in a decision tree have
exactly one incoming edge. An internal node is a node with an
incoming edge and outgoing edges, internal node is also known as
test node. Nodes with no outgoing edges are known as leaves or
terminal nodes.
15
• Figure 2 describes a decision tree
that reasons whether to send an
alert or to send a warning when an
error is triggered.
• In Figure 2, to depict the tree in
more clear way, we can represent
the internal nodes as circles, and
leaves as triangles, to tell them apart
in one glance. Each node is labeled
with the attribute it tests, and its
branches are labeled with its
corresponding values. The leaf nodes
represent class labels, i.e. the
outcome of all the prior decisions.
Figure 2
16
THE GENERAL DECISION TREE ALGORITHM
• The objective of a decision tree algorithm is to construct a tree ‘T’ from
a training set ‘S’. If all the records in ‘S’ belong to some class ‘C’, or if ‘S’
is sufficiently pure, then that node is considered a leaf node and
assigned the label ‘C’. The purity of a node is defined as its probability of
the corresponding class. [6]
• The algorithm constructs subtrees for the subsets of S recursively until
one of the following criteria is met: [7]
1. All the leaf nodes in the tree satisfy the minimum purity threshold.
2. The tree cannot be further split with the preset minimum purity
threshold.
3. Any other stopping criterion is satisfied (such as the maximum depth
of the tree).
17
DECISION TREES IN ‘R’
• In ‘R’, ‘rpart’ is for modelling decision trees, and an optional package
‘rpart.plot’ enables the plotting of a tree. [8] The rpart package can
be used for classification by decision trees and can also be used to
generate regression trees.
• To grow a tree, use: [9]
rpart (formula, data=, method=, control=)
Where,
18
• Entropy and information gain are common technical
terms/notations used in decision trees.
• Entropy is a measure of the number of random ways in which a
system may be arranged. For a data set ‘S’ containing ‘n’ records the
information entropy is defined as,[10]
• Let Hx is entropy of X defined as: [11]
(Here ‘Px’is the proportion of ‘S’)
19
Conclusion
• ‘R’ is the best at what it does- letting experts quickly and easily
interpret, interact with, and visualize data.
• The use of ‘R’ for implementing decision trees for classification in data
mining is highly beneficial over SAS and Matlab because:
i. ‘R’ is open source,
ii. ‘R’ has a large library support,
iii. ‘R’ supports visualization,
iv. all of this, while being inexpensive in comparison to both.
• Thanks to the worldwide community effort, ‘R’ keeps growing with
thousands of user-created packages built and added to enhance ‘R’
functionality.
20
References
[1] Data Mining Techniques. [Online]. Available:
•http://documents.software.dell.com/Statistics/Tex
tbook/Data-Mining-Techniques [Accessed 21
November 2015].
[2] Emmanuel Paradis, “R for Beginners”, CRAN.
Figure 1, pp. 8. [Online]. Available:
•https://cran.r-project.org/doc/contrib/Paradis-
rdebuts_en.pdf [Accessed 22 August 2015].
[3] Thomas Lumley, “R Fundamentals and
Programming Techniques,” in R Packages, January
2015, pg. 416. [Online].
[4] Adedin Culhane, Harvard School of Public
Health, BIO503, January 2013, “Introduction to
Programming and Statistical Modelling in R”.
[Online]. Available:
http://isites.harvard.edu/fs/docs/icb.topic1202070.fil
es/Bio503_winter13.pdf [Accessed 22 August 2015]. 21
[5] EMC Education Services, “Data Science and Big Data
Analytics,” in Advanced Analytical Theory and Methods:
Classification, January 2015, pp. 192.
[6] EMC Education Services, “Data Science and Big Data
Analytics,” in Advanced Analytical Theory and Methods:
Classification, January 2015, pg. 197.
[7] EMC Education Services, “Data Science and Big Data
Analytics,” in Advanced Analytical Theory and Methods:
Classification, January 2015, pg. 197.
[8] EMC Education Services, “Data Science and Big Data
Analytics,” in Advanced Analytical Theory and Methods:
Classification, January 2015, pg. 206.
[9] Data Mining Algorithms In R / Classification /
Decision Trees. [Online]. Available:
https://en.wikibooks.org/wiki/Data_Mining_Algo
rithms_In_R/Classification/Decision_Trees
[Accessed 27 November 2015].
[10] Venkatadri.M, Lokanatha C. Reddy. (2010, Apr.-
2010, Sept.). A comparative study on decision tree
classification algorithms in data mining.
International Journal of Computer Application in
Engineering, Technology and Sciences (IJ-CA-ETS).
[Online]. 1.1, pg. 1. Available:
https://www.academia.edu/
[11] Anuj Srivastava, Dr. Vinodini Katiyar, Navjot
Singh. (2015, June). AReview of Decision Tree
Algorithm: Big Data Analytics. International Journal
of Informative & Futuristic Research. ISSN (Online):
2347-1697, pg. 6.
22
23

More Related Content

What's hot

R programming
R programmingR programming
R programming
Dr. Vaibhav Kumar
 
Data and File Structure Lecture Notes
Data and File Structure Lecture NotesData and File Structure Lecture Notes
Data and File Structure Lecture Notes
FellowBuddy.com
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
Bhadra Gowdra
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
Victoria López
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
izahn
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
GRajendra
 
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESA PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
ijp2p
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Ii pu cs practical viva voce questions
Ii pu cs  practical viva voce questionsIi pu cs  practical viva voce questions
Ii pu cs practical viva voce questions
Prof. Dr. K. Adisesha
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
LakshmiSarvani6
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
iqbalphy1
 
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Chapter 1( intro & overview)
Chapter 1( intro & overview)Chapter 1( intro & overview)
Chapter 1( intro & overview)
MUHAMMAD AAMIR
 
Data structure
Data structureData structure
Data structureMohd Arif
 
4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Programming & Data Structure Lecture Notes
Programming & Data Structure Lecture NotesProgramming & Data Structure Lecture Notes
Programming & Data Structure Lecture Notes
FellowBuddy.com
 
Unit 2 linked list
Unit 2   linked listUnit 2   linked list
Unit 2 linked list
DrkhanchanaR
 
7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
Akshat Sharma
 

What's hot (19)

R programming
R programmingR programming
R programming
 
Data and File Structure Lecture Notes
Data and File Structure Lecture NotesData and File Structure Lecture Notes
Data and File Structure Lecture Notes
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESA PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Ii pu cs practical viva voce questions
Ii pu cs  practical viva voce questionsIi pu cs  practical viva voce questions
Ii pu cs practical viva voce questions
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
 
Chapter 1( intro & overview)
Chapter 1( intro & overview)Chapter 1( intro & overview)
Chapter 1( intro & overview)
 
Data structure
Data structureData structure
Data structure
 
4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil
 
Programming & Data Structure Lecture Notes
Programming & Data Structure Lecture NotesProgramming & Data Structure Lecture Notes
Programming & Data Structure Lecture Notes
 
Unit 2 linked list
Unit 2   linked listUnit 2   linked list
Unit 2 linked list
 
7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 

Similar to Research paper presentation

Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdf
MDDidarulAlam15
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
IBM
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
ranapoonam1
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
ssuser3c3f88
 
Airline Data Analysis
Airline Data AnalysisAirline Data Analysis
Airline Data Analysis
Pedro Craggett
 
R tutorial
R tutorialR tutorial
R tutorial
Richard Vidgen
 
A Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using RA Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using R
Nicole Adams
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
salutiontechnology
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
myworld93
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
R basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxR basics for MBA Students[1].pptx
R basics for MBA Students[1].pptx
rajalakshmi5921
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
NareshKarela1
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
Dr. Radhey Shyam
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
HaritikaChhatwal1
 
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali TirmiziFinancial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
Dr. Muhammad Ali Tirmizi., Ph.D.
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
IRJET Journal
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul Singh
Ravi Basil
 
PPT - Introduction to R.pdf
PPT - Introduction to R.pdfPPT - Introduction to R.pdf
PPT - Introduction to R.pdf
ssuser65af26
 
Statistical Analysis and Data Analysis using R Programming Language: Efficien...
Statistical Analysis and Data Analysis using R Programming Language: Efficien...Statistical Analysis and Data Analysis using R Programming Language: Efficien...
Statistical Analysis and Data Analysis using R Programming Language: Efficien...
BRNSSPublicationHubI
 

Similar to Research paper presentation (20)

Unit1_Introduction to R.pdf
Unit1_Introduction to R.pdfUnit1_Introduction to R.pdf
Unit1_Introduction to R.pdf
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Airline Data Analysis
Airline Data AnalysisAirline Data Analysis
Airline Data Analysis
 
R tutorial
R tutorialR tutorial
R tutorial
 
A Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using RA Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using R
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
R basics for MBA Students[1].pptx
R basics for MBA Students[1].pptxR basics for MBA Students[1].pptx
R basics for MBA Students[1].pptx
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
 
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali TirmiziFinancial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 4 by Dr. Syed Muhammad Ali Tirmizi
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul Singh
 
PPT - Introduction to R.pdf
PPT - Introduction to R.pdfPPT - Introduction to R.pdf
PPT - Introduction to R.pdf
 
Statistical Analysis and Data Analysis using R Programming Language: Efficien...
Statistical Analysis and Data Analysis using R Programming Language: Efficien...Statistical Analysis and Data Analysis using R Programming Language: Efficien...
Statistical Analysis and Data Analysis using R Programming Language: Efficien...
 

Recently uploaded

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Research paper presentation

  • 1. 1
  • 2. 2
  • 3. Understanding Decision Tree Algorithm using R Programming Language Presented By: Akshat Sharma#1, Anuj Srivastava#2 #Computer Science Department, Integral University Lucknow, Uttar Pradesh, India 1mailidakshat@gmail.com 2anuj@iul.ac.in 3
  • 4. Abstract Decision Tree is one of the most efficient technique to carry out data mining, which can be easily implemented by using R, a powerful statistical tool which is used by more than 2 million statisticians and data scientists worldwide. Decision trees can be used in a variety of disciplines, such as for predicting which patient characteristics are associated with high risk of a disease. When used together, we can find the relevant set of data, from the large data stored in the Enterprise Data Warehouses (EDWs). This provides new opportunities for organizations to derive new value from their most valuable and abundantly available asset: information, to create a competitive edge. 4
  • 5. Introduction • The first three phases of Data Analytics Lifecycle- discovery, data preparation, and model planning, involve various aspects of data exploration. The success of a data analysis project requires a deep understanding of the data, it requires a tool for data mining and presenting the data. We can use decision tree as a tool for data mining and R for presenting the data. • Data Mining is an analytic process designed to explore data (usually large amounts of data - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction. The most common type of data mining is Predictive data mining and it is the one that has the most direct business applications. [1] 5
  • 6. • Fundamental learning methods that appear in applications related to data mining are- 1. Clustering 2. Association Rules 3. Regression 4. Classification 5. Time Series Analysis 6. Text Analysis 6
  • 7. Introduction to ‘R’ • The ‘R’ language is a project designed to create a free, open source language which can be used as a replacement for the ‘S’ language, ‘R’ was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. ‘R’ first appeared in 1993; 23 years ago. • ‘R’ is named partly after the first names of the first two ‘R’ authors and partly as a play on the name of ‘S’. • ‘R’ is an open source implementation of ‘S’, and differs from S-plus largely in its command-line only format. • ‘S’ is a statistical programming language developed primarily by John Chambers and Rick Becker and Allan Wilks of Bell Laboratories. 7
  • 8. Figure 1. A schematic view of how ‘R’ works. [2] 8
  • 9. Installing R • R is freely downloadable from http://cran.r-project.org/ , as is a wide range of documentation. • Most users should download and install a binary version. This is a version that has been translated (by compilers) into machine language for execution on a particular type of computer with a particular operating system. Installation on Microsoft Windows is straightforward. • A binary version is available for windows 98 or above from the web page http://cran.r-project.org/bin/windows/base . 9
  • 10. 10
  • 11. Packages in ‘R’ • The most important single innovation in ‘R’ is the package system, which provides a cross-platform system for distributing and testing code and data. • The Comprehensive R Archive Network (http://cran.r- project.org) distributes public packages, but packages are also useful for internal distribution. • A package consists of a directory with a DESCRIPTION file and subdirectories with code, data, documentation, etc. The Writing R Extensions manual documents the package system, and package.skeleton() simplifies package creation. [3] 11
  • 12. Package Description base Base R functions datasets Base R datasets grDevices Graphics devices for base and grid graphics graphics R functions for base graphics methods Formally defined methods and classes for R objects stats R statistical functions utils R utility functions 12 Table 1. ‘R’ packages, loaded on startup. [4] The libraries shown in Table 1 are loaded on the ‘R’ startup.
  • 13. Applications of ‘R’ 1. Clinical Trials 2. Cluster Analysis 3. Computational Physics 4. Differential Equations 5. Econometrics 6. Environmental Studies 7. Experimental Design 8. Finance 9. Genetics 10. Graphical Models 13 11. Graphics and Visualizations 12. High Performance Computing 13. High Throughput Genomics 14. Machine Learning 15. Medical Imaging 16. Meta Analysis 17. Multivariate Statistics 18. Natural Language Processing 19. Official Statistics 20. Optimization
  • 14. Prominent Users of ‘R’ 14
  • 15. INTRODUCTION TO DECISION TREE • A decision tree is also called a prediction tree. A decision tree uses a structure to specify sequences of decisions and consequences. Given input X={X1, X2,..,Xn}, the goal is to predict a response or output variable Y. [5] Each member of the set {X1,X2,…,Xn} is called an input variable. • A decision tree consists of nodes, and thus form a rooted tree, this means that it is a directed tree with a node called root. There are no incoming edges on root node, all other nodes in a decision tree have exactly one incoming edge. An internal node is a node with an incoming edge and outgoing edges, internal node is also known as test node. Nodes with no outgoing edges are known as leaves or terminal nodes. 15
  • 16. • Figure 2 describes a decision tree that reasons whether to send an alert or to send a warning when an error is triggered. • In Figure 2, to depict the tree in more clear way, we can represent the internal nodes as circles, and leaves as triangles, to tell them apart in one glance. Each node is labeled with the attribute it tests, and its branches are labeled with its corresponding values. The leaf nodes represent class labels, i.e. the outcome of all the prior decisions. Figure 2 16
  • 17. THE GENERAL DECISION TREE ALGORITHM • The objective of a decision tree algorithm is to construct a tree ‘T’ from a training set ‘S’. If all the records in ‘S’ belong to some class ‘C’, or if ‘S’ is sufficiently pure, then that node is considered a leaf node and assigned the label ‘C’. The purity of a node is defined as its probability of the corresponding class. [6] • The algorithm constructs subtrees for the subsets of S recursively until one of the following criteria is met: [7] 1. All the leaf nodes in the tree satisfy the minimum purity threshold. 2. The tree cannot be further split with the preset minimum purity threshold. 3. Any other stopping criterion is satisfied (such as the maximum depth of the tree). 17
  • 18. DECISION TREES IN ‘R’ • In ‘R’, ‘rpart’ is for modelling decision trees, and an optional package ‘rpart.plot’ enables the plotting of a tree. [8] The rpart package can be used for classification by decision trees and can also be used to generate regression trees. • To grow a tree, use: [9] rpart (formula, data=, method=, control=) Where, 18
  • 19. • Entropy and information gain are common technical terms/notations used in decision trees. • Entropy is a measure of the number of random ways in which a system may be arranged. For a data set ‘S’ containing ‘n’ records the information entropy is defined as,[10] • Let Hx is entropy of X defined as: [11] (Here ‘Px’is the proportion of ‘S’) 19
  • 20. Conclusion • ‘R’ is the best at what it does- letting experts quickly and easily interpret, interact with, and visualize data. • The use of ‘R’ for implementing decision trees for classification in data mining is highly beneficial over SAS and Matlab because: i. ‘R’ is open source, ii. ‘R’ has a large library support, iii. ‘R’ supports visualization, iv. all of this, while being inexpensive in comparison to both. • Thanks to the worldwide community effort, ‘R’ keeps growing with thousands of user-created packages built and added to enhance ‘R’ functionality. 20
  • 21. References [1] Data Mining Techniques. [Online]. Available: •http://documents.software.dell.com/Statistics/Tex tbook/Data-Mining-Techniques [Accessed 21 November 2015]. [2] Emmanuel Paradis, “R for Beginners”, CRAN. Figure 1, pp. 8. [Online]. Available: •https://cran.r-project.org/doc/contrib/Paradis- rdebuts_en.pdf [Accessed 22 August 2015]. [3] Thomas Lumley, “R Fundamentals and Programming Techniques,” in R Packages, January 2015, pg. 416. [Online]. [4] Adedin Culhane, Harvard School of Public Health, BIO503, January 2013, “Introduction to Programming and Statistical Modelling in R”. [Online]. Available: http://isites.harvard.edu/fs/docs/icb.topic1202070.fil es/Bio503_winter13.pdf [Accessed 22 August 2015]. 21 [5] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pp. 192. [6] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pg. 197. [7] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pg. 197. [8] EMC Education Services, “Data Science and Big Data Analytics,” in Advanced Analytical Theory and Methods: Classification, January 2015, pg. 206.
  • 22. [9] Data Mining Algorithms In R / Classification / Decision Trees. [Online]. Available: https://en.wikibooks.org/wiki/Data_Mining_Algo rithms_In_R/Classification/Decision_Trees [Accessed 27 November 2015]. [10] Venkatadri.M, Lokanatha C. Reddy. (2010, Apr.- 2010, Sept.). A comparative study on decision tree classification algorithms in data mining. International Journal of Computer Application in Engineering, Technology and Sciences (IJ-CA-ETS). [Online]. 1.1, pg. 1. Available: https://www.academia.edu/ [11] Anuj Srivastava, Dr. Vinodini Katiyar, Navjot Singh. (2015, June). AReview of Decision Tree Algorithm: Big Data Analytics. International Journal of Informative & Futuristic Research. ISSN (Online): 2347-1697, pg. 6. 22
  • 23. 23