On distributed fuzzy decision trees for big data

CONTACT: PRAVEEN KUMAR. L (, +91 – 9791938249)
MAIL ID: sunsid1989@gmail.com, praveen@nexgenproject.com
Web: www.nexgenproject.com, www.finalyear-ieeeprojects.com
On Distributed Fuzzy Decision Trees for Big Data
Abstract:
Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of
fuzzy classification. The approaches proposed so far to FDT learning, however, have
generally neglected time and space requirements. In this paper, we propose a distributed FDT
learning scheme shaped according to the MapReduce programming model for generating
both binary and multi-way FDTs from big data. The scheme relies on a novel distributed
fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on
fuzzy information entropy. The fuzzy partitions are therefore used as input to the FDT
learning algorithm, which employs fuzzy information gain for selecting the attributes at the
decision nodes. We have implemented the FDT learning scheme on the Apache Spark
framework. We have used ten real-world publicly available big datasets for evaluating the
behavior of the scheme along three dimensions: i) performance in terms of classification
accuracy, model complexity and execution time, ii) scalability varying the number of
computing units and iii) ability to efficiently accommodate an increasing dataset size. We
have demonstrated that the proposed scheme turns out to be suitable for managing big
datasets even with modest commodity hardware support. Finally, we have used the
distributed decision tree learning algorithm implemented in the MLLib library and the Chi-
FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system,
for comparative analysis.
Existing System:
A decision tree can be generated efficiently from very large datasets. The various techniques
proposed in the literature can be roughly grouped into two categories, which are characterized
by performing a pre-sorting of the data or by adopting approximate representations of the
data such as samples and/or histograms [46]. While pre-sorting techniques are more accurate,
they cannot accommodate very large datasets or streaming data [46]. One of the oldest
approaches in the first category is SLIQ, proposed in [47]. SLIQ reduces decision tree
learning time without loss in accuracy by exploiting a pre-sorting technique in the tree-
growth phase. This technique is integrated with a breadth-first tree growing strategy to enable
classification of disk-resident datasets.
Proposed System:
we propose a distributed fuzzy discretizer and a distributed FDT (DFDT) learning scheme
upon the MapReduce programming model for managing big data. To the best of our
knowledge, in the context of big data, no distributed discretizer for generating fuzzy

partitions and no DFDT have been proposed in the literature. Our novel discretizer generates
a strong fuzzy partition for each continuous attribute by using a purposely adapted distributed
version of the well-known method proposed by Fayyad and Irani in [44]. The fuzzy partitions
computed by the discretizer are used as input to the DFDT learning algorithm. We adopt and
compare two different versions of the learning algorithm based on binary and multi-way
splits, respectively. Both the versions employ the information gain computed in terms of
fuzzy entropy for selecting the attribute to be adopted at each decision node.
Conclusions:
We have proposed a distributed fuzzy decision tree (FDT) learning scheme shaped according
to the MapReduce programming model for generating both binary (FBDT) and multiway
(FMDT) FDTs from big data. We have first introduced a novel distributed fuzzy discretizer,
which generates strong fuzzy partitions for each continuous attribute based on fuzzy
information entropy. Then, we have discussed a distributed implementation of an FDT
learning algorithm, which employs the fuzzy information gain for selecting the attributes to
be used in the decision nodes. We have implemented the FDT learning scheme on the Apache
Spark framework. we believe that the work presented in this paper is the first extensive study
on the application of FDTs to big data, considering both binary and multi-way splits. We
expect that the experimental results can be used as baseline for future research in this field.
REFERENCES
[1] R. Diao, K. Sun, V. Vittal, R. J. O’Keefe, M. R. Richardson, N. Bhatt, D. Stradford, and
S. K. Sarawgi, “Decision tree-based online voltage security assessment using PMU
measurements,” IEEE Transactions on Power Systems, vol. 24, no. 2, pp. 832–839, 2009.
[2] T. Goetz, The decision tree: Taking control of your health in the new era of personalized
medicine. Rodale Inc., 2010.
[3] Y. Zheng, L. Liu, L. Wang, and X. Xie, “Learning transportation mode from raw gps data
for geographic applications on the web,” in Proceedings of the 17th international conference
on World Wide Web, 2008, pp. 247–256.
[4] J. Han, M. Kamber, and J. Pei, Data mining: Concepts and techniques. Elsevier, 2011.
[5] L. Rokach and O. Maimon, Data mining with decision trees: Theory and applications.
World scientific, 2014.
[6] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106,
1986.

[7] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression
trees. CRC press, 1984.
[8] C. Z. Janikow, “Fuzzy decision trees: Issues and methods,” IEEE Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 28, no. 1, pp. 1–14, 1998.
[9] Y.-l. Chen, T. Wang, B.-s. Wang, and Z.-j. Li, “A survey of fuzzy decision tree
classifier,” Fuzzy Information and Engineering, vol. 1, no. 2, pp. 149–159, 2009.
[10] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 1993.
[11] X. Liu, X. Feng, and W. Pedrycz, “Extraction of fuzzy rules from fuzzy decision trees:
An axiomatic fuzzy sets (AFS) approach,” Data & Knowledge Engineering, vol. 84, pp. 1–
25, 2013.
[12] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, “The elements of statistical
learning: Data mining, inference and prediction,” The Mathematical Intelligencer, vol. 27, no.
2, pp. 83–85, 2005.
[13] H. Kim and W.-Y. Loh, “Classification trees with unbiased multiway splits,” Journal of
the American Statistical Association, pp. 589–604, 2011.
[14] F. Berzal, J.-C. Cubero, N. Marın, and D. Sanchez, “Building multi- ´ way decision trees
with numerical attributes,” Information Sciences, vol. 165, no. 1, pp. 73–90, 2004.
[15] Y. Yuan and M. J. Shaw, “Induction of fuzzy decision trees,” Fuzzy Sets and systems,
vol. 69, no. 2, pp. 125–139, 1995.
[16] R. Weber, “Fuzzy-ID3: A class of methods for automatic knowledge acquisition,” in Int.
Conf. on Fuzzy Logic & Neural Networks, 1992, pp. 265–268.
[17] M. Zeinalkhani and M. Eftekhari, “Fuzzy partitioning of continuous attributes through
discretization methods to construct fuzzy decision tree classifiers,” Information Sciences, vol.
278, pp. 715–735, 2014.
[18] S. Garcia, J. Luengo, J. A. Saez, V. Lopez, and F. Herrera, “A survey of ´ discretization
techniques: Taxonomy and empirical analysis in supervised learning,” IEEE Trans. on
Knowledge and Data Engineering, vol. 25, no. 4, pp. 734–750, 2013.

On distributed fuzzy decision trees for big data

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to On distributed fuzzy decision trees for big data

Similar to On distributed fuzzy decision trees for big data (20)

Recently uploaded

Recently uploaded (20)

On distributed fuzzy decision trees for big data