This document discusses a systematic literature review of distributed data mining (DDM) research studies conducted between 2000-2015. The review aimed to map previous DDM research and identify gaps to motivate future work. It analyzed 486 studies to develop statistics on DDM research trends over time. Key findings included identifying the most influential journals, active researchers, popular research topics, commonly used datasets and methods. The review provided a taxonomy of the DDM field and conclusions to help researchers gain a comprehensive overview of the current state of DDM research.
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
Outcomes in health-related issues including psychological, educational, Behavioral, environmental, and social are intended to sustain positive change by digital interferences. These changes may be delivered using any digital device like a phone or computer, and make them gainful for the provider. Complex and large-scale datasets that contain usage data can be yielded by testing a digital intervention. This data provides invaluable detail about how the users interact with these interventions and notify their knowledge of engagement, if they are analyzed properly. This paper recommends an innovative framework for the process of analyzing usage associated with a digital intervention .
PhD Assistance is an Academic The Best Dissertation Writing Service & Consulting Support Company established in 2001. specialiWeze in providing PhD Assignments, PhD Dissertation Writing Help , Statistical Analyses, and Programming Services to students in the USA, UK, Canada, UAE, Australia, New Zealand, Singapore and many more.
Website Visit: https://bit.ly/3dANXUD
Contact Us:
UK NO: +44-1143520021
India No: +91-8754446690
Email: info@phdassistance.com
An Empirical Study of the Applications of Classification Techniques in Studen...IJERA Editor
University servers and databases store a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. main problem that faces any system administration or any users is data increasing per-second, which is stored in different type and format in the servers, learning about students from a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. Graduation and academic information in the future and maintaining structure and content of the courses according to their previous results become importance. The paper objectives are extract knowledge from incomplete data structure and what the suitable method or technique of data mining to extract knowledge from a huge amount of data about students to help the administration using technology to make a quick decision. Data mining aims to discover useful information or knowledge by using one of data mining techniques, this paper used classification technique to discover knowledge from student’s server database, where all students’ information were registered and stored. The classification task is used, the classifier tree C4.5, to predict the final academic results, grades, of students. We use classifier tree C4.5 as the method to classify the grades for the students .The data include four years period [2006-2009]. Experiment results show that classification process succeeded in training set. Thus, the predicted instances is similar to the training set, this proves the suggested classification model. Also the efficiency and effectiveness of C4.5 algorithm in predicting the academic results, grades, classification is very good. The model also can improve the efficiency of the academic results retrieving and evidently promote retrieval precision.
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
In this paper we focus on some techniques for solving data mining tasks such as: Statistics, Decision Trees and Neural
Networks. The new approach has succeed in defining some new criteria for the evaluation process, and it has obtained valuable results
based on what the technique is, the environment of using each techniques, the advantages and disadvantages of each technique, the
consequences of choosing any of these techniques to extract hidden predictive information from large databases, and the methods of
implementation of each technique. Finally, the paper has presented some valuable recommendations in this field.
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEIJDKP
Knowledge Discovery in Databases is the process of finding knowledge in massive amount of data where
data mining is the core of this process. Data mining can be used to mine understandable meaningful patterns from large databases and these patterns may then be converted into knowledge.Data mining is the process of extracting the information and patterns derived by the KDD process which helps in crucial decision-making.Data mining works with data warehouse and the whole process is divded into action plan to be performed on data: Selection, transformation, mining and results interpretation. In this paper, we have reviewed Knowledge Discovery perspective in Data Mining and consolidated different areas of data
mining, its techniques and methods in it.
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
In this paper we have focused a variety of techniques, approaches and different areas of the research which
are helpful and marked as the important field of data mining Technologies. As we are aware that many MNC’s
and large organizations are operated in different places of the different countries. Each place of operation
may generate large volumes of data. Corporate decision makers require access from all such sources and
take strategic decisions .The data warehouse is used in the significant business value by improving the
effectiveness of managerial decision-making. In an uncertain and highly competitive business
environment, the value of strategic information systems such as these are easily recognized however in
today’s business environment, efficiency or speed is not the only key for competitiveness. This type of huge
amount of data’s are available in the form of tera- to peta-bytes which has drastically changed in the areas
of science and engineering. To analyze, manage and make a decision of such type of huge amount of data
we need techniques called the data mining which will transforming in many fields. This paper imparts more
number of applications of the data mining and also o focuses scope of the data mining which will helpful in
the further research.
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
Outcomes in health-related issues including psychological, educational, Behavioral, environmental, and social are intended to sustain positive change by digital interferences. These changes may be delivered using any digital device like a phone or computer, and make them gainful for the provider. Complex and large-scale datasets that contain usage data can be yielded by testing a digital intervention. This data provides invaluable detail about how the users interact with these interventions and notify their knowledge of engagement, if they are analyzed properly. This paper recommends an innovative framework for the process of analyzing usage associated with a digital intervention .
PhD Assistance is an Academic The Best Dissertation Writing Service & Consulting Support Company established in 2001. specialiWeze in providing PhD Assignments, PhD Dissertation Writing Help , Statistical Analyses, and Programming Services to students in the USA, UK, Canada, UAE, Australia, New Zealand, Singapore and many more.
Website Visit: https://bit.ly/3dANXUD
Contact Us:
UK NO: +44-1143520021
India No: +91-8754446690
Email: info@phdassistance.com
An Empirical Study of the Applications of Classification Techniques in Studen...IJERA Editor
University servers and databases store a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. main problem that faces any system administration or any users is data increasing per-second, which is stored in different type and format in the servers, learning about students from a huge amount of data including personal details, registration details, evaluation assessment, performance profiles, and many more for students and lecturers alike. Graduation and academic information in the future and maintaining structure and content of the courses according to their previous results become importance. The paper objectives are extract knowledge from incomplete data structure and what the suitable method or technique of data mining to extract knowledge from a huge amount of data about students to help the administration using technology to make a quick decision. Data mining aims to discover useful information or knowledge by using one of data mining techniques, this paper used classification technique to discover knowledge from student’s server database, where all students’ information were registered and stored. The classification task is used, the classifier tree C4.5, to predict the final academic results, grades, of students. We use classifier tree C4.5 as the method to classify the grades for the students .The data include four years period [2006-2009]. Experiment results show that classification process succeeded in training set. Thus, the predicted instances is similar to the training set, this proves the suggested classification model. Also the efficiency and effectiveness of C4.5 algorithm in predicting the academic results, grades, classification is very good. The model also can improve the efficiency of the academic results retrieving and evidently promote retrieval precision.
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
In this paper we focus on some techniques for solving data mining tasks such as: Statistics, Decision Trees and Neural
Networks. The new approach has succeed in defining some new criteria for the evaluation process, and it has obtained valuable results
based on what the technique is, the environment of using each techniques, the advantages and disadvantages of each technique, the
consequences of choosing any of these techniques to extract hidden predictive information from large databases, and the methods of
implementation of each technique. Finally, the paper has presented some valuable recommendations in this field.
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEIJDKP
Knowledge Discovery in Databases is the process of finding knowledge in massive amount of data where
data mining is the core of this process. Data mining can be used to mine understandable meaningful patterns from large databases and these patterns may then be converted into knowledge.Data mining is the process of extracting the information and patterns derived by the KDD process which helps in crucial decision-making.Data mining works with data warehouse and the whole process is divded into action plan to be performed on data: Selection, transformation, mining and results interpretation. In this paper, we have reviewed Knowledge Discovery perspective in Data Mining and consolidated different areas of data
mining, its techniques and methods in it.
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
In this paper we have focused a variety of techniques, approaches and different areas of the research which
are helpful and marked as the important field of data mining Technologies. As we are aware that many MNC’s
and large organizations are operated in different places of the different countries. Each place of operation
may generate large volumes of data. Corporate decision makers require access from all such sources and
take strategic decisions .The data warehouse is used in the significant business value by improving the
effectiveness of managerial decision-making. In an uncertain and highly competitive business
environment, the value of strategic information systems such as these are easily recognized however in
today’s business environment, efficiency or speed is not the only key for competitiveness. This type of huge
amount of data’s are available in the form of tera- to peta-bytes which has drastically changed in the areas
of science and engineering. To analyze, manage and make a decision of such type of huge amount of data
we need techniques called the data mining which will transforming in many fields. This paper imparts more
number of applications of the data mining and also o focuses scope of the data mining which will helpful in
the further research.
A gigantic archive of terabytes of information is created every day from current data frameworks and computerized advances, for example, Internet of Things and distributed computing. Examination of these gigantic information requires a ton of endeavors at various levels to extricate information for dynamic. Hence, huge information examination is an ebb and flow region of innovative work. The essential goal of this paper is to investigate the likely effect of huge information challenges, and different instruments related with it. Accordingly, this article gives a stage to investigate enormous information at various stages. Moreover, it opens another skyline for analysts to build up the arrangement, in light of the difficulties and open exploration issues.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file includes text and multimedia contents. The primary objective of this big data concept is to describe the extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V” dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity. Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is described with the types of the data, Value which derives the business value and Veracity describes about the quality of the data and data understandability. Nowadays, big data has become unique and preferred research areas in the field of computer science. Many open research problems are available in big data and good solutions also been proposed by the researchers even though there is a need for development of many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper, a detailed study about big data, its basic concepts, history, applications, technique, research issues and tools are discussed.
6 ijaems sept-2015-6-a review of data security primitives in data miningINFOGAIN PUBLICATION
This paper has discussed various issues and security primitives like Spatial Data Handing, Privacy Protection of data, Data Load Balancing, Resource Mining etc. in the area of Data Mining.A 5-stage review process has been conductedfor 30 research papers which were published in the period of year ranging from 1996 to year 2013. After an exhaustive review process, nine key issues were found “Spatial Data Handing, Data Load Balancing, Resource Mining ,Visual Data Mining, Data Clusters Mining, Privacy Preservation, Mining of gaps between business tools & patterns, Mining of hidden complex patterns.” which have been resolved and explained with proper methodologies. Several solution approaches have been discussed in the 30 papers. This paper provides an outcome of the review which is in the form of various findings, found under various key issues. The findings included algorithms and methodologies used by researchers along with their strengths and weaknesses and the scope for the future work in the area.
Selection of Articles using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
Outcomes in health-related issues including psychological, educational, Behavioral, environmental, and social are intended to sustain positive change by digital interferences. These changes may be delivered using any digital device like a phone or computer, and make them gainful for the provider. Complex and large-scale datasets that contain usage data can be yielded by testing a digital intervention. This data provides invaluable detail about how the users interact with these interventions and notify their knowledge of engagement, if they are analyzed properly. This paper recommends an innovative framework for the process of analyzing usage associated with a digital intervention .
PhD Assistance is an Academic The Best Dissertation Writing Service & Consulting Support Company established in 2001. specialiWeze in providing PhD Assignments, PhD Dissertation Writing Help , Statistical Analyses, and Programming Services to students in the USA, UK, Canada, UAE, Australia, New Zealand, Singapore and many more.
Website Visit: https://bit.ly/3dANXUD
Contact Us:
UK NO: +44-1143520021
India No: +91-8754446690
Email: info@phdassistance.com
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
Data Mining System and Applications: A Reviewijdpsjournal
In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information, generate knowledge and disseminate data, information and knowledge to every stakeholder. Due to vast use of computers and electronics devices and tremendous growth in computing power and storage capacity, there is explosive growth in data collection. The storing of the data in data warehouse enables entire enterprise to access a reliable current database. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications.
There are numerous ways to analyse the web information, generally web substance are housed in
large information sets and basic inquiries are utilized to parse such information sets. As the requests
expanded with time, mining web information amended to meet challenging task in a web analysis.
Machine learning methodologies are the most up to date one to go into these analysis forms. Different
approaches like decision trees, association rules, Meta heuristic and basic learning methods are embraced
for making web data appraisal and mining data from various web instances. This study will highlight these
approaches in perspective of web investigation. One of the prime goals of this exploration is to investigate
more data mining approaches alongside machine learning systems, and to express emerging collaboration
of web analytics with artificial intelligence.
de l'Economie et des Finances | En charge des questions financières et monéta...aminellaoui
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A gigantic archive of terabytes of information is created every day from current data frameworks and computerized advances, for example, Internet of Things and distributed computing. Examination of these gigantic information requires a ton of endeavors at various levels to extricate information for dynamic. Hence, huge information examination is an ebb and flow region of innovative work. The essential goal of this paper is to investigate the likely effect of huge information challenges, and different instruments related with it. Accordingly, this article gives a stage to investigate enormous information at various stages. Moreover, it opens another skyline for analysts to build up the arrangement, in light of the difficulties and open exploration issues.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file includes text and multimedia contents. The primary objective of this big data concept is to describe the extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V” dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity. Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is described with the types of the data, Value which derives the business value and Veracity describes about the quality of the data and data understandability. Nowadays, big data has become unique and preferred research areas in the field of computer science. Many open research problems are available in big data and good solutions also been proposed by the researchers even though there is a need for development of many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper, a detailed study about big data, its basic concepts, history, applications, technique, research issues and tools are discussed.
6 ijaems sept-2015-6-a review of data security primitives in data miningINFOGAIN PUBLICATION
This paper has discussed various issues and security primitives like Spatial Data Handing, Privacy Protection of data, Data Load Balancing, Resource Mining etc. in the area of Data Mining.A 5-stage review process has been conductedfor 30 research papers which were published in the period of year ranging from 1996 to year 2013. After an exhaustive review process, nine key issues were found “Spatial Data Handing, Data Load Balancing, Resource Mining ,Visual Data Mining, Data Clusters Mining, Privacy Preservation, Mining of gaps between business tools & patterns, Mining of hidden complex patterns.” which have been resolved and explained with proper methodologies. Several solution approaches have been discussed in the 30 papers. This paper provides an outcome of the review which is in the form of various findings, found under various key issues. The findings included algorithms and methodologies used by researchers along with their strengths and weaknesses and the scope for the future work in the area.
Selection of Articles using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
Outcomes in health-related issues including psychological, educational, Behavioral, environmental, and social are intended to sustain positive change by digital interferences. These changes may be delivered using any digital device like a phone or computer, and make them gainful for the provider. Complex and large-scale datasets that contain usage data can be yielded by testing a digital intervention. This data provides invaluable detail about how the users interact with these interventions and notify their knowledge of engagement, if they are analyzed properly. This paper recommends an innovative framework for the process of analyzing usage associated with a digital intervention .
PhD Assistance is an Academic The Best Dissertation Writing Service & Consulting Support Company established in 2001. specialiWeze in providing PhD Assignments, PhD Dissertation Writing Help , Statistical Analyses, and Programming Services to students in the USA, UK, Canada, UAE, Australia, New Zealand, Singapore and many more.
Website Visit: https://bit.ly/3dANXUD
Contact Us:
UK NO: +44-1143520021
India No: +91-8754446690
Email: info@phdassistance.com
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
Data Mining System and Applications: A Reviewijdpsjournal
In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information, generate knowledge and disseminate data, information and knowledge to every stakeholder. Due to vast use of computers and electronics devices and tremendous growth in computing power and storage capacity, there is explosive growth in data collection. The storing of the data in data warehouse enables entire enterprise to access a reliable current database. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications.
There are numerous ways to analyse the web information, generally web substance are housed in
large information sets and basic inquiries are utilized to parse such information sets. As the requests
expanded with time, mining web information amended to meet challenging task in a web analysis.
Machine learning methodologies are the most up to date one to go into these analysis forms. Different
approaches like decision trees, association rules, Meta heuristic and basic learning methods are embraced
for making web data appraisal and mining data from various web instances. This study will highlight these
approaches in perspective of web investigation. One of the prime goals of this exploration is to investigate
more data mining approaches alongside machine learning systems, and to express emerging collaboration
of web analytics with artificial intelligence.
de l'Economie et des Finances | En charge des questions financières et monéta...aminellaoui
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Similar to A Survey And Taxonomy Of Distributed Data Mining Research Studies A Systematic Literature Review (20)
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
A Survey And Taxonomy Of Distributed Data Mining Research Studies A Systematic Literature Review
1.
Abstract— Context: Data Mining (DM) method has been
evolving year by year and as of today there is also the
enhancement of DM technique that can be run several times
faster than the traditional one, called Distributed Data Mining
(DDM). It is not a new field in data processing actually, but in the
recent years many researchers have been paying more attention
on this area. Problems: The number of publication regarding
DDM in high reputation journals and conferences has increased
significantly. It makes difficult for researchers to gain a
comprehensive view of DDM that require further research.
Solution: We conducted a systematic literature review to map the
previous research in DDM field. Our objective is to provide the
motivation for new research by identifying the gap in DDM field
as well as the hot area itself. Result: Our analysis came up with
some conclusions by answering 7 research questions proposed in
this literature review. In addition, the taxonomy of DDM
research area is presented in this paper. Finally, this systematic
literature review provides the statistic of development of DDM
since 2000 to 2015, in which this will help the future researchers
to have a comprehensive overview of current situation of DDM.
Index Terms— association rules, classification, clustering, data
mining, distributed data mining, parallel data mining.
I. INTRODUCTION
ecently the centralized data mining techniques are
commonly used to analyze the large either corporate or
scientific data which stored in database [1]. The main
challenge in data mining is to find the relationship among data
quickly and correctly [2]. The emerging of large and big data
yields the heavy process of the single computer to complete
the calculation task. However such significant growth of the
data volume day by day forces the researchers to provide more
advanced method or strategy to solve this problem.
Over the last few years, parallel and distributed computing
became more famous mainly on data processing and
information extraction. The birth of distributed computing
over several years ago could deal with this current problem in
which the mined data currently is not only in range of
Fauzi Adi Rafrastara is with the School of Computer Science and
Engineering, South China University of Technology, Guangzhou, China,
510006, on leave from Dian Nuswantoro University, Semarang, Indonesia. E-
mail: fauzi_adi@yahoo.co.id.
Qi Deyu is a Professor in School of Computer Science and Engineering,
South China University of Technology, Guangzhou, China, 510006. E-mail:
qideyu@scut.edu.cn.
Megabytes to Gigabytes, but even more than Terabytes and
Petabytes. Social media and web service produce a fantastic
amount of data which touching the scale of Petabytes daily.
The existence of large dataset and the needs to process that
information quickly makes the use of distributed or parallel
computing is really important today [3].
The commodity hardware currently can be connected to the
clusters easily for running the complex task in distributed
environment. The combination of data mining and distributed
computing can improve the mining performance of data
mining algorithm especially in large and distributed dataset.
Recently the emerging of DDM becomes extremely important.
It focuses on the data analysis in distributed environment
while paying attention on several issues related to the
computation problem, storage, data communication and
human-computer interaction as well [4].
This paper will discuss the current research of distributed
data mining (DDM). We downloaded and reviewed 486 high
quality research studies to provide the statistics, mind map,
and taxonomy regarding the situation of DDM research
nowadays. This paper consists of 4 chapters. First chapter is
introduction. The second chapter discusses the methodology
that we used. The statistical result will be shown and
explained in chapter 3. Last chapter will provide the
conclusion of this research.
II. RESEARCH METHOD
For reviewing the literature on the DDM field, a systematic
methodology was applied in this work. Systematic Literature
Review (SLR) initially was a well-known systematic review
approach in software engineering area [5][6][7][8][9] and
currently becoming more popular in other computer science
fields as well, such as cloud computing [10], distributed
computing [11], and internet technology [12].
The main objective of SLR is to present the correct
assessment, identification and interpretation of all available
research evidence regarding the research topic being studied,
using the reliable, rigorous and auditable methodology.
Finally, SLR can answer the specific research questions based
on the collected data after completing the review process
[5][11][9].
The review method which applied in this work was
following the guidelines proposed by Kitchenham and
Charters [11], and also inspired by some other researchers
A Survey and Taxonomy of Distributed Data
Mining Research Studies: A Systematic
Literature Review
Fauzi Adi Rafrastara, Graduate Student Member, IEEE, Qi Deyu
R
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
12 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. [9][5][8][12][10].
A. Review Method
In this work, SLR is divided into 3 stages, namely: planning,
conducting and reporting the literature review. On the first
stage, there are 3 steps involved. Firstly we identify the
requirements for a systematic review. Step 2 is performed to
develop the review protocol which is used as a foundation to
obtain the sharp result and to reduce the possibility of
researcher bias. In this step, research questions are constructed
along with defining the search strategy, inclusion and
exclusion criteria, quality assessment, and finally data
extraction and synthesis process. Those all parts of review
protocol are discussed in Section 2.1, 2.2, 2.3, 2.4, 2.5, 2.6. In
step 3, we evaluate the developed review protocol. This
evaluation is done in planning stage and improved iteratively
during the conducting and reporting stage.
B. Research Question
Research Question (RQ) is formulated mainly to make the
review stay focused. According to the PICOC criteria
introduced by Kitchenham and Charters [11], we present the
summary of Population, Intervention, Comparison, Outcomes,
and Context of our research in Table 1.
Based on the PICOC table in Table 1, we develop the
research questions and motivation in this literature review as
shown in Table 2.
RQ1 to RQ3 are constructed to help researchers to evaluate
the context of the primary studies. They provide the summary
and synopsis of some particular publications, authors and
research areas in DDM field. On the other hand, RQ4 to RQ7
are the main research questions on this literature review. They
talk about the datasets, popular methods and new proposed
methods in this field.
To give the simpler illustration regarding our research
questions, the basic mind map of this systematic literature
review is provided as shown in Figure 2.
C. Search Process
On the conducting stage, the step is started with searching
for primary studies. It consists of several activities, such as
selecting digital libraries, defining the search string and
retrieving the high quality papers that related to the research
topic which being discussed. To find the high quality papers,
some digital libraries must be specified first. Three well
known literature databases in the field of computer science are
selected and listed as follows:
Fig. 1. The steps of SLR Process.
TABLE 1
SUMMARY OF PICOC
Population Distributed system, parallel system
Intervention
Data mining, methods, algorithms, techniques,
datasets
Comparison -
Outcomes Successful DDM methods
Context Studies in industry and academia
TABLE 2
RESEARCH QUESTIONS AND MOTIVATIONS OF LITERATURE REVIEW
ID Research Question Motivation
RQ1
Which journal is the most
significant in DDM journal?
Identifying the most
significant journal in DDM
RQ2
Who are the most active and
influential researchers in the
DDM field?
Identifying the most active
and influential researchers
who contributed so much in
DDM field
RQ3
What kind of research topics
are selected by researchers
in the DDM field?
Identifying the research topics
and trends in DDM
RQ4
What kinds of datasets that
commonly used for DDM?
Identifying the datasets that
commonly used in DDM
RQ5
What kinds of methods are
used for DDM?
Identifying opportunities and
trends for DDM’s method
RQ6
What kinds of methods are
the most used for DDM?
Identifying the most used
methods in DDM
RQ7
What kinds of method
improvements are proposed
for DDM?
Identifying the proposed
method improvements for
DDM
Fig. 2. Mind Map of Research Questions.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
13 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. 1) ACM Digital Library (dl.acm.org)
2) IEEE eXplore (ieeeexplore.ieee.org)
3) ScienceDirect (sciencedirect.com)
Specific search string is used to collect the articles from
those three digital libraries. The search string is developed
based on the following steps:
1) Defining the search term based on PICOC criteria,
especially from Population and Intervention.
2) Defining the search term from research questions.
3) Defining the search term in relevant titles, abstracts and
keywords.
4) Defining the synonyms, antonyms and alternative spellings
of search terms.
5) Implementing the advanced search string using identified
search terms, Boolean AND and OR.
The resulted search string is as follows:
(Distributed OR parallel) AND ((“data mining”) AND
(method* OR algorithm OR datasets))
The search string is adjusted depend on the specific
requirements of each digital library. However, the original
search will be kept to avoid the significant increment of
irrelevant studies. During the search process, search string is
implemented based on the title, abstract and keyword of the
documents. The studies selected in this literature review are
the high reputation journals that obtained from 3 popular
online digital libraries: ScienceDirect, IEEExplore, and ACM.
The search process is conducted in the end of Mei 2015,
covering the papers published since January 2000 to Mei 2015
D. Paper Selection
Primary study selections are conducted by using the
inclusion and exclusion criteria of the searched articles. Table
3 is showing the accepted and unaccepted criteria of the
documents being reviewed.
The search result is stored and managed using a software
package, called Mendeley (http://mendeley.com). There are 5
stages in paper selection which described as follows:
1) Applying the search query to all digital libraries.
2) Excluding the invalid and duplicate documents.
3) Applying inclusion and exclusion criteria to the papers
title, abstract and keywords.
4) Applying inclusion and exclusion criteria to the
introduction and conclusion part of the papers.
5) Reviewing the selected documents and applying the
inclusion and exclusion criteria to the text or content.
Table 4 is showing the stages along with digital libraries
and numbers of study that has been identified. All of the
involved studies are listed in Table A-4 in Appendix.
III. RESEARCH RESULT
A. Significant Journal Publications
Among 85 final studies which downloaded from 3 digital
libraries, there are 34 journal names and 4 publishers that
successfully identified. According to the Scimago Journal
Ranking (SJR) (http://scimagojr.com) those papers vary in
SJR’s indicator and quartile category. 2 journals published
more than 10 articles discussing about DDM, such as: Future
Generation Computer Systems and Journal of Parallel and
Distributed Computing, in which both are published by
Elsevier. 17 Journals only published each 1 paper related to
this topic. The detail SJR statistic of selected studies is shown
in Appendix, Table A-5.
B. Most Active and Influential Researchers
According to the data successfully collected, 251
researchers are involved in publishing 85 high quality papers
regarding DDM. However, there are only 17 researchers
which published 2 or more papers in ACM, IEEE or
ScienceDirect. Fig. 4 shows the name of the researchers who
published more than 1 paper in DDM area. Unfortunately,
only 4 researchers which put their name as a first author and
only one of them that constantly published up to 3 papers.
Jaideep Vaidya and Domenico Talia can be noted that they are
TABLE 3
INCLUSION AND EXCLUSION CRITERIA
Inclusion
Criteria
Studies that discussing data mining technique in
distributed environment.
Studies that discussing the improvement or
implementation of DDM and conducting the experiment
using at least 1 DDM algorithm.
Studies published in journal, transaction or high quality
conference.
For the same studies that have duplicate publication,
only the most complete and newest one will be included.
Studies published within January 2000 to Mei 2015.
Exclusion
Criteria
Studies that discussing Data Mining but not using
Distributed/Parallel System.
Studies that discussing Distributed/Parallel System but
not using Data Mining techniques.
Studies without experimental process and result using at
least 1 DDM Algorithm. Demonstration product or
software will not be considered as an experimental
process.
The data is not a text or number. Graph and Image will
be excluded.
Studies that not written in English
TABLE 4
STUDIES SELECTION
No. Publisher
Stages
1 2 3 4 5
1. SD 336 328 82 68 68
2. ACM 35 20 9 6 6
3. IEEE 97 33 14 11 11
Total 468 381 105 85 85
Fig. 3. The growth of studies about DDM year by year
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
14 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. the most active researchers in this area with 3 publications,
and surprisingly Jaideep Vaidya has successfully published his
3 papers as a first author.
C. Research Topics in DDM
In this section, we categorize the paper contents into 3
categories, those are: Improvement, Implementation and
Parallelism. Improvement means the researchers proposed the
novelty or improvement in DDM area. Mostly it relates to the
improvement of the existing algorithm and improvement in
data security.
Implementation means the researchers attempted to apply or
implement the current DDM technology to satisfy their needs.
It can be the implementation of existing DDM algorithm to
different area, such as computer science, medical, or even
transportation.
Whereas parallelism is an effort of the researchers to
convert the conventional data mining into distributed one.
They modified and improved the DM algorithm to be DDM
algorithm, and then compared each other to show and prove
that DDM algorithm is better than conventional one.
Regarding the improvement part, we conclude that there are
three areas targeted by the researchers, called: efficiency,
effectiveness and security. Most of researchers are focusing on
improvement of efficiency in DDM, followed by improvement
on security and effectiveness.
Talking about improvement of efficiency, it contains
speedup, resource & cost, and scalability. Speedup focuses on
the improvement of the speed during processing data and
collecting result. Resource & cost means invention of reducing
the involved resource during the data processing as well as
minimizing the cost for the project. Improvement of scalability
is more about ideas to create a technology that can be scaled
up easily according to the needs of fast data processing with
much bigger data later on. As a result, 25 papers are
discussing the improvement in efficiency, and 17 of them are
focusing on speedup enhancement. Statistic of papers
distribution regarding improvement of efficiency is illustrated
in Fig. 6 and Fig. 7.
On the other hand, improvement of effectiveness in DDM
research field mainly focuses on the level of accuracy during
the mining process and result gathering. Information produced
by DDM technology should be much better in term of
accuracy, day by day, so that it can help to make a better
decision as well. Not many researchers pay more attention on
this field. It was only 4 studies that proposing the
enhancement in term of accuracy.
Regarding the enhancement of security part in DDM, 9
studies are identified proposing the new level of security. All
of them are discussing about privacy preserving.
In 2015, Loh & Yu [13] introduced CudaSCAN, the
improvement of performance of DBSCAN algorithm by
adding the power of GPU accelerator. They were not only
simply adding GPU technology inside, but they enhanced the
use of GPU so that it could perform better in term of
efficiency (speed up) compared to CUDA-DClast, the existing
GPU-based DBSCAN [14]. Cuzzocrea et al. also discussed the
hottest topic here (speedup), by proposing the so called Tree-
based Distributed Uncertain Frequent Itemset Mining [15].
This algorithm mainly is used to mine the constrained frequent
itemsets from distributed uncertain data.
Enhancement in accuracy has been done by Di Fatta et. al.
[16] in which they improved the k-means algorithm to be the
so called Epidemic K-Means algorithm. It is a fully distributed
K-Means method that does not require global communication
and it is intrinsically fault tolerant. The authors of [17] also
proposed a novel idea in security area, by using harmony
search and pruning ensemble for malware detection.
According to their experiment, this algorithm outperforms the
existing ensemble algorithm in term of detection accuracy.
In the security field, the algorithm improvement has been
done by adding the privacy preserving feature into several
Fig. 4. Most active and influential researchers in DDM field.
Fig. 5. DDM Research paper categories
Fig. 6. Statistic of studies of improvement.
Fig. 7. Statistic of studies of improvement in efficiency.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
15 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. DDM algorithms, such as Naive Bayes [18], ID3 [19],
Random Forest [20], Apriori [21], Back Propagation [22], etc.
Since the term of data mining was firstly introduced in the
computer science field in 1989 [23], so it can be normal if
DDM is much more famous in computer science rather than
other fields of science. From the Fig. 8, we can see that the
difference between numbers of studies discussing DDM in
Computer Science (CS) and other disciplines is too wide.
There are 82 studies discussed purely about CS and only 2
studies about collaboration between CS and medical. There is
only 1 paper involving transportation field that has been
recorded in this survey.
Zheng & Wang in 2014 attempted to parallelize the Pruning
Eclat Algorithm by using MapReduce to study the method of
road transport management information [23]. By the
parallelism, this algorithm achieved a better performance by
reducing a time waste more than 40% compared to the
conventional one. Parallel implementation is also used beyond
the computer science field, such as Medical, wherein Genetic
Algorithm was parallelized to analyze the large datasets as
published by Rausch et al. in 2008. They proposed such
technique to discover patterns in genetic markers that indicate
the tendency to multifactorial disease [24]. Olejnik et al. also
discussed the cross-field research when they implemented the
parallel algorithm in medical science field. The paper involved
Clustering Distributed Progressive algorithm using DiabCare
Medical Database [25].
Especially in computer science area, since the number of
identified studies is too big, then we break it down into 8
categories, those are: Network & Internet, Software
Engineering, Security & Privacy, Hardware Acceleration,
Bioinformatics, E-Government, and General Data Mining
Area. Topic distribution of the DDM studies can be seen in
Figure 9. Network & Internet, Security & Privacy, and
Hardware Acceleration can be considered as hot topic beside
General DDM field. 35 studies have been identified that their
focuses are on the general DDM area. This topic covers the
discussion about DDM method improvement, implementation,
or parallelism without concerning to any other computer
science fields.
D. Datasets Used in DDM
This section will discuss deeper about datasets used by
researchers in DDM. According to the access of the dataset,
we classify them into two categories, namely: public and
private dataset. Those classifications are derived from the
dataset used and mentioned by the researchers in their studies.
42% of researchers use private databases, whereas 38% use
the public one. Surprisingly, 16% of studies involve both
public and private datasets. The rest 4% are considered using
unknown dataset, since they did not mention what dataset that
they use. The detail statistic regarding dataset and research
paper can be seen in Fig. 10 (Left).
Most of the public dataset recorded in this survey are
derived from UCI machine learning (http://archive.ics.uci.edu/
ml/datasets.html).
Regarding the origin of dataset, we divide them into two
groups, called: real world and synthetic dataset. Real world
dataset means the researcher captured the real data directly
from the nature, whereas synthetic dataset means data is
obtained from the simulation or created by themself. 47% of
studies use the real world datasets, whereas 34% of them use
the synthetic one. What interesting is 16% of studies use both
real world and synthetic data for their experiment.
Most of the public dataset that used by DDM researchers
collected from the UCI Machine Learning Repository. There
are 37 studies using this repository and 120 different UCI
datasets have been downloaded for their experiments. Iris and
Mushroom become the most popular UCI dataset since they
are used in 9 different DDM papers. In addition, 9 papers with
private dataset collected the data by utilizing IBM Generator
Tool.
E. Method Used in DDM
According to Luo et al., algorithm library layer of data
mining is composed by three main components, such as:
association, classification, and clustering [1]. This paper is
Fig. 8. Statistic of DDM studies in CS and non CS.
Fig. 9. Statistic of DDM studies in CS field.
Fig. 10. Statistic dataset based on access (Left) and origin (Right)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
16 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. following their idea and breaking down the DDM algorithm
into the same 3 main problems.
Based on this review, 85 DDM studies come up with 192
numbers of algorithms in total, in which some of them use the
same popular algorithm, such as apriori, k-means, knn, and
decision tree. Surprisingly, we have found 137 different kinds
of methods that used in DDM research, including association,
classification and clustering algorithm.
Detail statistic of methods involved is listed in Table A-2
(Appendix). The Fig. 12 illustrates the distribution of methods
according to data mining categories.
We have found the fact that there are 53 different clustering
algorithms, 42 classification algorithms and 42 association
algorithms have been discussed in 85 reviewed research
papers. It means that there are some studies involving more
than one DDM problems in a single research paper. In 2013,
Villar et al. did an experiment by involving genetic algorithm
(classification) and hill climbing algorithm (clustering) to get
the optimal internal configuration of all the switches in the
network of large supercomputers that running parallel
applications [26]. In another research article, Ericson and
Pallickara studied 4 clustering algorithms and 2 classification
algorithms by doing the benchmark test using Mahout back-
end code [27].
As mentioned in the beginning of this section, 192 numbers
of algorithms are involved in our 85 main research papers.
Most of them are clustering algorithms, with 74 algorithms,
followed by classification and association with 63 and 55
algorithms respectively. The detail statistic regarding DDM
methods reviewed in this paper is listed in Appendix, Table A-
2.
F. Most Used Method in DDM
By using the statistic from the previous section, now we can
come up with the statistic of the most used method in DDM
research (see Table 4). In Association category, 5 algorithms
are used more than once. The most popular one is Apriori
algorithm with 9 papers involved.
On the other hand, Classification category has 42 different
algorithms whereby 5 of them can be regarded as the most
popular one since their algorithms are involved in more than 2
research studies, i.e. ID3, C4.5, Neural Network, Naive Bayes,
and K-Nearest Neighbor.
Finally in clustering area, 7 out of 53 algorithms can be
noted as the most popular one. K-Means algorithm is leading
conveniently with 12 studies involved, whereas Expectation
Maximization (EM) and P2P K-Means has only 5 and 3
research papers respectively.
G. Proposed Method Improvement for DDM
Around 30% of the studies that have been reviewed are
talking about DDM improvement. Most of the improvement
parts are done in Association area with 15 papers involved.
The distribution of improvement papers either in classification
and clustering are the same, both of them have 8 studies only.
The list of new algorithms can be seen in Table A-3
(Appendix). In Association category, 14 algorithms belong to
Frequent Pattern Mining, whereas Sequence Pattern Mining
only has 1 proposed algorithm, called Prioritized Sensitive
Patterns with Dynamic Blocking [28]. In 2010, Yu & Zhou
[29] attempted to improve the capability of Parallel FP-Tree,
by proposing a novel approach called Tidset-based Parallel
FP-Tree (TPFP) and Balanced Tidset-based Parallel FP-Tree
(BTPFP). In 2013, their proposed algorithm both are
implemented on the research of Lin & Lo to construct 4 novel
algorithms, they are: Equal Working Set (EWS), Request on
Demand (ROD), Small Size Working Set (SSWS), and
Progressive Size Working Set (PSWS) algorithm. Their
improvement can provide a fast and scalable mining service
Fig. 11. Statistic of methods involved
Fig. 12. Statistic of frequent methods used
TABLE 4
MOST USED DDM METHODS
Category Algo Name Studies Totals
Association
1. Apriori [14][15][29][30][31]
[32][33][20][34]
9
2. Eclat [28][35] 2
3. FP-Tree [36][37] 2
4. TPFP [36][38] 2
5. BTPFP [36][38] 2
Classification
1. ID3 [18][39][40][41] 4
2. C4.5 [33][41][42][43][44]
[45][46][47][48]
9
3. Neural
Network
[49][50][51] 3
4. Naive Bayes [17][26][52] 3
5. K-Nearest
Neighbor
[37][53][54] 3
Clustering
1. K-Means [37][42][26][53][54]
[55][56][57][58][15]
[59][60]
12
2. P2P K-Means [56][61][15] 3
3. Expectation
Maximization
(EM)
[34][53][62][63][64] 5
4. DBSCAN [33] [65] 2
5. SOM [66][67] 2
6. AutoClass [42][68] 2
7. DPC [55][24] 2
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
17 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
7. for frequent pattern mining in many-task computing
environment.
In Classification field, there is an interesting fact that 5 out
of 8 proposed methods are focused on security part. The goal
of such methods is to avoid disclosing data beyond its source.
The researchers proposed the hybridization of algorithm by
combining the DDM classification method with privacy
preserving technique. Vaidya et. al. improved the security
level of ID3 Decision Tree and Naive Bayes by adding the
capability of privacy preserving to those both algorithms
[19][18]. In [18] they proposed that secure classification
algorithms for vertically and horizontally partitioned data,
whereas in [19] only for vertically partitioned data. In 2013,
Sheen et al. [17] proposed a music inspired algorithm, called
Harmony Search Ensemble (HS_ENSEM). This method is
utilized for malware detection. An Ensemble is constructed by
using multiple heterogeneous classifiers in parallel fashion. To
get the pruned set, the harmony search is utilized to choose the
best set of classifiers which obtained from the ensemble.
By the 8 novel clustering algorithms, K-Means becomes
popular algorithm to improve as well as Expectation
Maximization (EM). The original K-Means was improved
twice to be Sequential Sampling Spectral K-Means [30] and
Epidemic K-Means [16]. In 2012 Di Fatta et al [16] proposed
the fully distributed K-Means Algorithm (Epidemic K-Means)
wherein they claimed that global communication is not
required anymore by using this method and it is intrinsically
fault tolerant. As mentioned above, EM also becomes famous
algorithm here as it is used and improved twice by the
different researchers [31][32]. In 2005 the basic EM was
combined with Privacy Preserving (PP) technique by Merugu
and Ghosh [31] to enhance the security level of EM method.
At the end, the researchers claimed that based on their
experiment, PPEM can achieve the high quality global cluster
with little loss of privacy. This technique actually is based on
building probabilistic models of the data at each local site, in
which the parameters are transmitted to a central location
afterwards. Another story happened in 2015 where Loh & Yu
[13] proposed an improvement in clustering algorithm. They
successfully improved the efficiency of DBSCAN algorithm,
by proposing a novel technique called CudaSCAN. The
researchers explained that there are three phases in
CudaSCAN: (1) Partitioning the entire dataset; (2) local
clustering within sub-regions in parallel; (3) merging the local
clustering results. By their experiments, they claimed that
CudaSCAN outperformed CUDA-DClust (a previous
DBSCAN extention), by up to 163.6 times.
H. DDM Taxonomy
By considering the review outcome from 86 high reputation
studies, DDM taxonomy is proposed to map the current
situation of DDM research. This taxonomy is aimed to help
the future researcher to get the simple way to understand the
general DDM area, including the hot area inside and the gap
between topics under DDM field. This taxonomy is developed
from the data collected during the review process. The
references supporting this taxonomy are provided in Table A-
4 (Appendix).
Fig. 13. Numbers of paper discussing improvement of DDM algorithm
Fig 14. DDM Taxonomy
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
18 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
8. Firstly DDM is divided into three categories, those are:
DDM Problem, DDM Research, and Dataset. In DDM
Problem, we classify the reviewed papers into 3 sub
categories, namely: Association, Classification, and
Clustering. The DDM research study can be one of those items
or combination of several categories that mentioned above.
DDM Research has also 3 sub categories, called:
Improvement, Implementation, and Parallelism. Only
Improvement that has sub categories, those are: Efficiency,
Effectiveness (accuracy), and Security. Efficiency itself can be
explored deeper by classifying it into 3 groups: Speed up,
Resource & Cost, and Scalability. Effectiveness is
emphasizing on the accuracy of the algorithm, whereas in the
security part, the previous researchers just focused on the
implementation or improvement of the privacy preserving in
DDM field.
The Dataset can be assessed from their availability and their
origin. Availability means weather the dataset is publicly
available or not. Origin means weather the researcher captures
the dataset from the real data or generates it synthetically. The
complete DDM Taxonomy is illustrated in Fig. 14.
IV. CONCLUSION
This survey is conducted using systematic literature review
methodology. The aim of SLR is to find the answers of some
specific research questions by analyzing, assessing,
identifying and interpreting all available research evidences
related to the research topic being studied, using the reliable,
rigorous and auditable methodology [5][9][11].
The contribution of SLR in this study is mainly to identify
and analyze the trends, datasets and methods used in DDM
research between 2000 and 2015. By following the SLR
methodology, we collected more than 400 high quality journal
articles from three major digital libraries (such as: IEEE,
ScienceDirect, and ACM), and finally 85 papers that have
direct discussion about DDM were selected. Seven research
questions has been constructed, explored, and answered in this
study as well.
Based on the analysis of the selected studies, those papers
can be categorized into 3 major focuses: DDM Research
Opportunities, DDM Datasets, and DDM Methods. Regarding
DDM Research Opportunities, actually DDM area is still
widely opened for all researchers around the world, since the
research improvement that has been done is not so high, it is
only 26 out of 85 DDM journal papers, or about 30.6 % from
the total selected studies. In addition, the gap between
improvement in efficiency and effectiveness is quite big.
There are 20 papers proposing a new method for efficiency,
but only 2 papers emphasizing on the enhancement of
effectiveness. By this fact, the effectiveness of DDM
technique, especially about level of accuracy, can be an
interesting topic for the next DDM research, so that the quality
improvement of a new method is not merely about speed or
efficiency, but the accuracy as well. The improvement of
security is also cannot be underestimated since the privacy
preserving issue will always be the hot issue to be discussed
and improved.
The majority implementations of DDM are in computer
science field. It is noted that there are only 2 papers discussing
DDM in the area beyond computer science according this
survey, those are 2 papers in Medical field and 1 paper in
Transportation field. It means that the area of DDM research
actually is wide and open. A lot of research field that have not
been influenced by DDM technology yet. Or if there are some
other fields that already used DDM technology, but their
research and innovation have not been published yet. It can be
a good opportunity for the researcher to do such kind of
research and publish it.
On the other hand, we have found a fact that majority
papers in DDM field are using the private dataset. Actually the
proportion of private and public dataset according to our
survey is almost the same. However the numbers of private
dataset that have been used by DDM researchers is slightly
higher than the public one. There are 42% of DDM papers are
using the private dataset and 39% of them are using the public
data. And surprisingly there are 15% of the studies that use
both private and public dataset. However, the fact about the
dominance of the use of the private dataset is very lamentable.
It can be a critical problem for the continuity of DDM research
later on, since a proposed method cannot be compared with
the existing method if the dataset is private. It is impossible to
make sure whether the result of the proposed method is surely
better than the existing one or not.
Regarding the method used in DDM field, this paper
categorized it into 2 topic areas, which are: the methods
mostly used in DDM research and the new methods proposed
by DDM researchers. DDM method basically is coming from
the Data Mining paradigm. There are three major groups on
DDM method, called: Association, Classification, and
Clustering. In Association area, Apriori algorithm is the most
used method with 9 papers involved. C4.5 Decision Tree
becomes the most famous method in Classification area with 9
papers. Whereas for the clustering algorithm, K-Means takes
the place with 12 studies that conducting experiment with this
algorithm.
Finally, the improvement of algorithm in DDM field since
January 2000 to Mei 2015 have discussed in 26 studies
(downloaded from ACM, IEEE, and ScienceDirect).
Association algorithm becomes the top choice to improve in
which 15 studies are discussing about improvement in
Association Rule Mining. 8 improvements are happened in
classification algorithm wherein security is very dominant
here. 5 out of 8 studies addressed the security problem in
classification algorithm. 8 novel approaches are also proposed
in clustering algorithm. In this area, the enhancement of k-
means and EM looks like the favorite of the researchers.
The last but not least, DDM taxonomy has been proposed
with the aim to help the future researcher to have simpler and
better understanding regarding DDM field. The taxonomy
illustrates the map inside the DDM area so that the researcher
can easily find the specific area that want to be focused to.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
19 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
9. REFERENCES
[1] J. Luo, M. Wang, J. Hu, and Z. Shi, ―Distributed data mining on Agent
Grid: Issues, platform and development toolkit,‖ Futur. Gener. Comput.
Syst., vol. 23, no. 1, pp. 61–68, 2007.
[2] F. Min, H. He, Y. Qian, and W. Zhu, ―Test-cost-sensitive attribute
reduction,‖ Inf. Sci. (Ny)., vol. 181, no. 22, pp. 4928–4942, 2011.
[3] R. Jin and G. Agrawal, ―A methodology for detailed performance
modeling of reduction computations on SMP machines,‖ Perform. Eval.,
vol. 60, no. 1–4, pp. 73–105, 2005.
[4] S. Mukherjee and H. Kargupta, ―Distributed probabilistic inferencing in
sensor networks using variational approximation,‖ J. Parallel Distrib.
Comput., vol. 68, no. 1, pp. 78–92, 2008.
[5] P. Brereton, B. a. Kitchenham, D. Budgen, M. Turner, and M. Khalil,
―Lessons from applying the systematic literature review process within
the software engineering domain,‖ J. Syst. Softw., vol. 80, no. 4, pp.
571–583, 2007.
[6] S. Beecham, N. Baddoo, T. Hall, H. Robinson, and H. Sharp,
―Motivation in Software Engineering: A systematic literature review,‖
Inf. Softw. Technol., vol. 50, no. 9–10, pp. 860–878, 2008.
[7] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, and
S. Linkman, ―Systematic literature reviews in software engineering - A
systematic literature review,‖ Inf. Softw. Technol., vol. 51, no. 1, pp. 7–
15, 2009.
[8] R. S. Wahono, ―A Systematic Literature Review of Software Defect
Prediction : Research Trends , Datasets , Methods and Frameworks,” J.
Softw. Eng., vol. 1, no. 1, pp. 1–16, 2015.
[9] R. Latif, H. Abbas, S. Assar, and Q. Ali, ―Cloud Computing Risk
Assessment: A Systematic Literature Review,‖ in Future Information
Technology, 2013, pp. 285–295.
[10] I. Polato, R. Ré, A. Goldman, and F. Kon, ―A comprehensive view of
Hadoop research—A systematic literature review,‖ J. Netw. Comput.
Appl., vol. 46, pp. 1–25, 2014.
[11] B. Kitchenham and S. Charters, ―Guidelines for performing Systematic
Literature Reviews in Software Engineering,‖ Engineering, vol. 2, p.
1051, 2007.
[12] I. Hydara, A. B. Sultan, H. Zulzalil, and N. Admodisastro, ―Current state
of research on cross-site scripting ( XSS ) – A systematic literature
review,‖ Inf. Softw. Technol., vol. 58, pp. 170–186, 2015.
[13] W.-K. Loh and H. Yu, ―Fast density-based clustering through dataset
partition using graphics processing units,‖ Inf. Sci. (Ny)., vol. 308, no.
2010, pp. 94–112, 2015.
[14] C. Bohm, R. Noll, C. Plant, and B. Wackersreuther, ―Density-based
Clustering using Graphics Processors,‖ in Proceeding of the 18th ACM
conference on Information and knowledge management - CIKM ’09,
2009, pp. 661–670.
[15] A. Cuzzocrea, C. K. S. Leung, and R. K. Mackinnon, ―Mining
constrained frequent itemsets from distributed uncertain data,‖ Futur.
Gener. Comput. Syst., vol. 37, pp. 117–126, 2014.
[16] G. Di Fatta, F. Blasa, S. Cafiero, and G. Fortino, ―Fault tolerant
decentralised K-Means clustering for asynchronous large-scale
networks,‖ J. Parallel Distrib. Comput., vol. 73, no. 3, pp. 317–329,
2013.
[17] S. Sheen, R. Anitha, and P. Sirisha, ―Malware detection by pruning of
parallel ensembles using harmony search,‖ Pattern Recognit. Lett., vol.
34, no. 14, pp. 1679–1686, 2013.
[18] J. Vaidya, M. Kantarcioglu, and C. Clifton, ―Privacy-preserving Naïve
Bayes classification,‖ VLDB J., vol. 17, no. 4, pp. 879–898, 2008.
[19] J. Vaidya, C. Clifton, M. Kantarcioglu, and A. S. Patterson, ―Privacy-
preserving decision trees over vertically partitioned data,‖ ACM Trans.
Knowl. Discov. Data, vol. 2, no. 3, pp. 1–27, 2008.
[20] E. Magkos, M. Maragoudakis, V. Chrissikopoulos, and S. Gritzalis,
―Accurate and large-scale privacy-preserving data mining using the
election paradigm,‖ Data Knowl. Eng., vol. 68, no. 11, pp. 1224–1236,
2009.
[21] X. Yi and Y. Zhang, ―Privacy-preserving distributed association rule
mining via semi-trusted mixer,‖ Data Knowl. Eng., vol. 63, no. 2, pp.
550–567, 2007.
[22] S. Samet and A. Miri, ―Privacy-preserving back-propagation and
extreme learning machine algorithms,‖ Data Knowl. Eng., vol. 79–80,
pp. 40–61, 2012.
[23] X. Zheng and S. Wang, ―Study on the Method of Road Transport
Management Information Data Mining based on Pruning Eclat
Algorithm and MapReduce,‖ Procedia - Soc. Behav. Sci., vol. 138, pp.
757–766, 2014.
[24] T. Rausch, A. Thomas, N. J. Camp, L. a. Cannon-Albright, and J. C.
Facelli, ―A parallel genetic algorithm to discover patterns in genetic
markers that indicate predisposition to multifactorial disease,‖ Comput.
Biol. Med., vol. 38, no. 7, pp. 826–836, 2008.
[25] R. Olejnik, T. F. Fortis, and B. Toursel, ―Webservices oriented data
mining in knowledge architecture,‖ Futur. Gener. Comput. Syst., vol. 25,
no. 4, pp. 436–443, 2009.
[26] J. a. Villar, F. J. Andújar, J. L. Sánchez, F. J. Alfaro, J. a. Gámez, and J.
Duato, ―Obtaining the optimal configuration of high-radix Combined
switches,‖ J. Parallel Distrib. Comput., vol. 73, no. 9, pp. 1239–1250,
2013.
[27] K. Ericson and S. Pallickara, ―On the performance of high dimensional
data clustering and classification algorithms,‖ Futur. Gener. Comput.
Syst., vol. 29, no. 4, pp. 1024–1034, 2013.
[28] B. N. Keshavamurthy, D. Toshniwal, and B. K. Eshwar, ―Hiding co-
occurring prioritized sensitive patterns over distributed progressive
sequential data streams,‖ J. Netw. Comput. Appl., vol. 35, no. 3, pp.
1116–1129, 2012.
[29] K.-M. Yu and J. Zhou, ―Parallel TID-based frequent pattern mining
algorithm on a PC Cluster and grid computing system,‖ Expert Syst.
Appl., vol. 37, no. 3, pp. 2486–2494, 2010.
[30] D. Mavroeidis and P. Magdalinos, ―A Sequential Sampling Framework
for Spectral k-Means Based on Efficient Bootstrap Accuracy
Estimations: Application to Distributed Clustering,‖ ACM Trans.
Knowl. Discov. Data, vol. 6, no. 2, pp. 1–37, 2012.
[31] S. Merugu and J. Ghosh, ―A privacy-sensitive approach to distributed
clustering,‖ Pattern Recognit. Lett., vol. 26, no. 4, pp. 399–410, 2005.
[32] X. Zhang, W. K. Cheung, and C. H. Li, ―Learning latent variable models
from distributed and abstracted data,‖ Inf. Sci. (Ny)., vol. 181, no. 14,
pp. 2964–2988, 2011.
[33] E.-H. (Sam) Han, G. Karypis, and V. Kumar, ―Scalable parallel data
mining for association rules,‖ IEEE Trans. Knowl. Data Eng., vol. 12,
no. 3, pp. 337–352, 2000.
[34] V. Fiolet and B. Toursel, ―A clustering method to distribute a database
on a grid,‖ Futur. Gener. Comput. Syst., vol. 23, no. 8, pp. 997–1002,
2007.
[35] K. W. Lin and S. Chung, ―A fast and resource efficient mining algorithm
for discovering frequent patterns in distributed computing
environments,‖ Futur. Gener. Comput. Syst., vol. 52, pp. 49–58, 2015.
[36] K. W. Lin and Y. C. Luo, ―Efficient Algorithms for Frequent Pattern
Mining in Many-Taks Computing Environments,‖ Knowledge-Based
Syst., vol. 49, pp. 620–623, 2013.
[37] D. Nguyen, B. Vo, and B. Le, ―Efficient strategies for parallel mining
class association rules,‖ Expert Syst. Appl., vol. 41, no. 10, pp. 4716–
4729, 2014.
[38] M. Rodríguez, D. M. Escalante, and A. Peregrín, ―Efficient Distributed
Genetic Algorithm for Rule extraction,‖ Appl. Soft Comput., vol. 11, no.
1, pp. 733–743, 2011.
[39] L. Vu and G. Alaghband, ―Novel parallel method for association rule
mining on multi-core shared memory systems,‖ Parallel Comput., vol.
40, no. 10, pp. 768–785, 2014.
[40] Z. Farzanyar, M. Kangavari, and N. Cercone, ―P2P-FISM: Mining
(recently) frequent item sets from distributed data streams over P2P
network,‖ Inf. Process. Lett., vol. 113, no. 19–21, pp. 793–798, 2013.
[41] M. M. Rashid, I. Gondal, and J. Kamruzzaman, ―A Technique for
Parallel Share-frequent Sensor Pattern Mining from Wireless Sensor
Networks,‖ Procedia Comput. Sci., vol. 29, pp. 124–133, 2014.
[42] A. Faro, D. Giordano, and F. Maiorana, ―Mining massive datasets by an
unsupervised parallel clustering on a GRID: Novel algorithms and case
study,‖ Futur. Gener. Comput. Syst., vol. 27, no. 6, pp. 711–724, 2011.
[43] T. Tassa and E. Gudes, ―Secure distributed computation of anonymized
views of shared databases,‖ ACM Trans. Database Syst., vol. 37, no. 2,
pp. 1–43, 2012.
[44] J. C. da Silva and M. Klusch, ―Inference in distributed data clustering,‖
Eng. Appl. Artif. Intell., vol. 19, no. 4, pp. 363–369, 2006.
[45] Y. Zhang, F. Zhang, Z. Jin, and J. D. Bakos, ―An FPGA-Based
Accelerator for Frequent Itemset Mining,‖ ACM Trans. Reconfigurable
Technol. Syst., vol. 6, no. 1, pp. 1–17, 2013.
[46] M. Coppola and M. Vanneschi, ―High-performance data mining with
skeleton-based structured parallel programming,‖ Parallel Comput., vol.
28, no. 5, pp. 793–813, 2002.
[47] M. S. Pérez, A. Sánchez, V. Robles, P. Herrero, and J. M. Peña, ―Design
and implementation of a data mining grid-aware architecture,‖ Futur.
Gener. Comput. Syst., vol. 23, no. 1, pp. 42–47, 2007.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
20 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
10. [48] K.-M. Yu, J. Zhou, T. P. Hong, and J. L. Zhou, ―A load-balanced
distributed parallel mining algorithm,‖ Expert Syst. Appl., vol. 37, no. 3,
pp. 2459–2464, 2010.
[49] C. Ghedini Ralha and C. V. Sarmento Silva, ―A multi-agent data mining
system for cartel detection in Brazilian government procurement,‖
Expert Syst. Appl., vol. 39, no. 14, pp. 11642–11656, 2012.
[50] G. Wu, H. Zhang, M. Qiu, Z. Ming, J. Li, and X. Qin, ―A decentralized
approach for mining event correlations in distributed system
monitoring,‖ J. Parallel Distrib. Comput., vol. 73, no. 3, pp. 330–340,
2013.
[51] D. Jin and S. G. Ziavras, ―A super-programming approach for mining
association rules in parallel on PC clusters,‖ IEEE Trans. Parallel
Distrib. Syst., vol. 15, no. 9, pp. 783–794, 2004.
[52] G. A. Ruoming Jin, Ge Yang, ―Shared Memory Parallelization of Data
Mining Algorithms,‖ IEEE Trans. Knowl. Data Eng., vol. 17, no. 1, pp.
71–89, 2005.
[53] G. Lee, W. Yang, and J. M. Lee, ―A parallel algorithm for mining
multiple partial periodic patterns,‖ Inf. Sci. (Ny)., vol. 176, no. 24, pp.
3591–3609, 2006.
[54] D. Souliou, A. Pagourtzis, N. Drosinos, and P. Tsanakas, ―Computing
frequent itemsets in parallel using partial support trees,‖ J. Syst. Softw.,
vol. 79, no. 12, pp. 1735–1743, 2006.
[55] M. K. Sohrabi and A. A. Barforoush, ―Parallel frequent itemset mining
using systolic arrays,‖ Knowledge-Based Syst., vol. 37, pp. 462–471,
2013.
[56] T. P. Hong, Y. C. Lee, and M. T. Wu, ―An effective parallel approach
for genetic-fuzzy data mining,‖ Expert Syst. Appl., vol. 41, no. 2, pp.
655–662, 2014.
[57] M. Zaki, ―Parallel Sequence Mining on Shared-Memory Machines,‖ J.
Parallel Distrib. Comput., vol. 61, no. 3, pp. 401–426, 2001.
[58] V. Guralnik and G. Karypis, ―Parallel tree-projection-based sequence
mining algorithms,‖ Parallel Comput., vol. 30, no. 4, pp. 443–472, 2004.
[59] C.-H. Wu, C.-C. Lai, and Y.-C. Lo, ―An empirical study on mining
sequential patterns in a grid computing environment,‖ Expert Syst.
Appl., vol. 39, no. 5, pp. 5748–5757, 2012.
[60] J. Secretan, M. Georgiopoulos, A. Koufakou, and K. Cardona, ―APHID:
An architecture for private, high-performance integrated data mining,‖
Futur. Gener. Comput. Syst., vol. 26, no. 7, pp. 891–904, 2010.
[61] F. Emekci, O. D. Sahin, D. Agrawal, and a. El Abbadi, ―Privacy
preserving decision tree learning over multiple parties,‖ Data Knowl.
Eng., vol. 63, no. 2, pp. 348–361, 2007.
[62] F. Saqib, A. Dutta, J. Plusquellic, P. Ortiz, and M. S. Pattichis,
―Pipelined Decision Tree Classification Accelerator Implementation in
FPGA (DT-CAIF),‖ IEEE Trans. Comput., vol. 64, no. 1, pp. 1–1, 2013.
[63] J. Vaidya, B. Shafiq, W. Fan, D. Mehmood, and D. Lorenzi, ―A Random
Decision Tree Framework for Privacy-Preserving Data Mining,‖ IEEE
Trans. Dependable Secur. Comput., vol. 11, no. 5, pp. 399–411, 2014.
[64] J. P. Bradford and J. a. B. Fortes, ―Characterization and Parallelization
of Decision-Tree Induction,‖ J. Parallel Distrib. Comput., vol. 61, no. 3,
pp. 322–349, 2001.
[65] V. Furtado, F. Flavio de Souza, and W. Cirne, ―Promoting performance
and separation of concerns for data mining applications on the grid,‖
Futur. Gener. Comput. Syst., vol. 23, no. 1, pp. 100–106, 2007.
[66] O. T. Yildiz and O. Dikmen, ―Parallel univariate decision trees,‖ Pattern
Recognit. Lett., vol. 28, no. 7, pp. 825–832, 2007.
[67] I. Czarnowski and P. Jdrzejowicz, ―An agent-based framework for
distributed learning,‖ Eng. Appl. Artif. Intell., vol. 24, no. 1, pp. 93–102,
2011.
[68] M. Cannataro, A. Congiusta, A. Pugliese, D. Talia, and P. Trunfio,
―Distributed data mining on grids: Services, tools, and applications,‖
IEEE Trans. Syst. Man, Cybern. Part B Cybern., vol. 34, no. 6, pp.
2451–2465, 2004.
[69] N. Garcia-Pedrajas, J. Perez-Rodriguez, and A. de Haro-Garcia,
―OligoIS: Scalable instance selection for class-imbalanced data sets,‖
IEEE Trans. Cybern., vol. 43, no. 1, pp. 332–346, 2013.
[70] N. Mohammed, D. Alhadidi, B. C. M. Fung, and M. Debbabi, ―Secure
two-party differentially private data release for vertically partitioned
data,‖ IEEE Trans. Dependable Secur. Comput., vol. 11, no. 1, pp. 59–
71, 2014.
[71] H. Senger, E. R. Hruschka, F. a B. Silva, L. M. Sato, C. P. Bianchini,
and B. F. Jerosch, ―Exploiting idle cycles to execute data mining
applications on clusters of PCs,‖ J. Syst. Softw., vol. 80, no. 5, pp. 778–
790, 2007.
[72] D. B. Skillicorn and S. M. McConnell, ―Distributed prediction from
vertically partitioned data,‖ J. Parallel Distrib. Comput., vol. 68, no. 1,
pp. 16–36, 2008.
[73] J. Ouyang, N. Patel, and I. Sethi, ―Induction of multiclass multifeature
split decision trees from distributed data,‖ Pattern Recognit., vol. 42, no.
9, pp. 1786–1794, 2009.
[74] I. Triguero, D. Peralta, J. Bacardit, S. Garcia, and F. Herrera, ―MRPR :
A MapReduce solution for prototype reduction in big data classi
fication,” Neurocomputing, vol. 150, pp. 331–345, 2014.
[75] Y. Kokkinos and K. G. Margaritis, ―Confidence ratio affinity
propagation in ensemble selection of neural network classifiers for
distributed privacy-preserving data mining,‖ Neurocomputing, vol. 150,
pp. 513–528, 2015.
[76] L. Glimcher, R. Jin, and G. Agrawal, ―Middleware for data mining
applications on clusters and grids,‖ J. Parallel Distrib. Comput., vol. 68,
no. 1, pp. 37–53, 2008.
[77] S. Mukherjee, Z. Chen, and A. Gangopadhyay, ―A privacy-preserving
technique for Euclidean distance-based mining algorithms using Fourier-
related transforms,‖ VLDB J., vol. 15, no. 4, pp. 293–315, 2006.
[78] V. Stankovski, M. Swain, V. Kravtsov, T. Niessen, D. Wegener, J.
Kindermann, and W. Dubitzky, ―Grid-enabling data mining applications
with DataMiningGrid: An architectural perspective,‖ Futur. Gener.
Comput. Syst., vol. 24, no. 4, pp. 259–279, 2008.
[79] Y. Lu, V. Roychowdhury, and L. Vandenberghe, ―Distributed Parallel
Support Vector Machines in Strongly Connected Networks,‖ IEEE
Trans. Neural Networks, vol. 19, no. 7, pp. 1–12, 2008.
[80] C. Anglano and M. Botta, ―NOW G-net: Learning classification
programs on networks of workstations,‖ IEEE Trans. Evol. Comput.,
vol. 6, no. 5, pp. 463–480, 2002.
[81] F. Stahl and M. Bramer, ―Computationally efficient induction of
classification rules with the PMCRI and J-PMCRI frameworks,‖
Knowledge-Based Syst., vol. 35, pp. 49–63, 2012.
[82] A. Cano, J. L. Olmo, and S. Ventura, ―Parallel multi-objective Ant
Programming for classification using GPUs,‖ J. Parallel Distrib.
Comput., vol. 73, no. 6, pp. 713–728, 2013.
[83] S. M. Dima, C. Panagiotou, D. Tsitsipis, C. Antonopoulos, J. Gialelis,
and S. Koubias, ―Performance evaluation of a WSN system for
distributed event detection using fuzzy logic,‖ Ad Hoc Networks, vol.
23, pp. 87–108, 2014.
[84] T. A. Engel, A. S. Charao, M. Kirsch-Pinheiro, and L.-A. Steffenel,
―Performance improvement of data mining in weka through GPU
acceleration,‖ Procedia Comput. Sci., vol. 32, pp. 93–100, 2014.
[85] Z. Qi, V. Alexandrov, Y. Shi, and Y. Tian, ―Parallel regularized
multiple-criteria linear programming,‖ Procedia Comput. Sci., vol. 31,
no. 0, pp. 58–65, 2014.
[86] C. L. Huang and J. F. Dun, ―A distributed PSO-SVM hybrid system with
feature selection and parameter optimization,‖ Appl. Soft Comput., vol.
8, no. 4, pp. 1381–1391, 2008.
[87] Y. Zhang, F. Mueller, X. Cui, and T. Potok, ―Data-intensive document
clustering on graphics processing unit (GPU) clusters,‖ J. Parallel
Distrib. Comput., vol. 71, no. 2, pp. 211–224, 2011.
[88] T. Goodall, D. Pettinger, and G. Di Fatta, ―Non-uniform data
distribution for communication-efficient parallel clustering,‖ J. Comput.
Sci., vol. 4, no. 6, pp. 489–495, 2013.
[89] T. Gunarathne, B. Zhang, T. L. Wu, and J. Qiu, ―Scalable parallel
computing on clouds using Twister4Azure iterative MapReduce,‖ Futur.
Gener. Comput. Syst., vol. 29, no. 4, pp. 1035–1048, 2013.
[90] W. Cerroni, G. Moro, R. Pasolini, and M. Ramilli, ―Decentralized
detection of network attacks through P2P data clustering of SNMP
data,‖ Comput. Secur., vol. 52, pp. 1–16, 2015.
[91] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and
S. Datta, ―Clustering distributed data streams in peer-to-peer
environments,‖ Inf. Sci. (Ny)., vol. 176, no. 14, pp. 1952–1985, 2006.
[92] C. Pizzuti and D. Talia, ―P-AutoClass: Scalable parallel clustering for
mining large data sets,‖ IEEE Trans. Knowl. Data Eng., vol. 15, no. 3,
pp. 629–641, 2003.
[93] Y. Kim, K. Shim, M.-S. Kim, and J. Sup Lee, ―DBCURE-MR: An
efficient density-based clustering algorithm for large data using
MapReduce,‖ Inf. Syst., vol. 42, pp. 15–35, 2014.
[94] S. T. Li, ―A web-aware interoperable data mining system,‖ Expert Syst.
Appl., vol. 22, no. 2, pp. 135–146, 2002.
[95] A. Congiusta, D. Talia, and P. Trunfio, ―Service-oriented middleware
for distributed data mining on the grid,‖ J. Parallel Distrib. Comput., vol.
68, no. 1, pp. 3–15, 2008.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
21 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
11. [96] Z. Feng, B. Zhou, and J. Shen, ―A parallel hierarchical clustering
algorithm for PCs cluster system,‖ Neurocomputing, vol. 70, no. 4–6,
pp. 809–818, 2007.
[97] M. C. Naldi and R. J. G. B. Campello, ―Evolutionary k-means for
distributed data sets,‖ Neurocomputing, vol. 127, pp. 30–42.
Fauzi Adi Rafrastara received
Bachelor and Master degree in
Computer Science from Dian
Nuswantoro University (2009) and
Technical University of Malaysia
Malacca (2011), respectively. He is
currently pursuing the Ph.D degree at
the School of Computer Science and
Engineering of South China
University of Technology. He
published 8 books in Indonesia and
Malaysia, and several papers in
International Conferences and Journals. His research interest
includes data processing, multimedia, and information
security. He is a member of TheIRED, the IEEE and the IEEE
Computer Society.
Qi Deyu was born in Helin county of
Inner Mongolia in October, 1959. He
has got bachelor degree of science,
master degree of engineering, and
doctor degree of engineering. Now he
serves in South China University of
Technology (SCUT) as the professor,
tutor, leader of the academic team
―advanced computing architecture‖ of
the Xinghua engineering project, and
director of the Computer Systems
Research Institute of SCUT. His
research area includes the computer architecture, distributed
systems, computer security, etc.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
22 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
12. 1
APPENDIX
TABLE A-1
STATISTIC OF PAPERS WITH RESEARCH IMPROVEMENT
Improvement Focus IEEE ACM SD
Efficiency
Speedup [33] [15][34][35][36][13][20][37][38][39][29][32]
Resource & Cost [30] [31][40][35][41][39]
Scalability [40]
Effectiveness Accuracy [16][42][17][32]
Security Privacy Preserving [43][19][18] [44][28][20][31][22][21]
TABLE A-2
THE LIST OF METHODS USED IN DDM
Category Methods Studies Methods Studies
Association
Eclat [39] [45] Ye & Chiang Algo [48]
Apriori [46][47][21][48][49][50][39][33][51] EDMA [48]
Apriori + PP [21] P2P FISM [40]
Apriori with IDD + HD [33] Majority Rule [40]
U-Apriori [15] EWS [36]
UF-Growth [15] ROD [36]
FP-Growth [39] SSWS [36]
FP-Tree [29] [52] PSWS [36][35]
TPFP [29][36] SABMA [55]
BTPFP [29][36] PP-Tree [55]
TDM [53] LFP-Tree [55]
DPM [53] DFP [55]
Count Distribution [54] Genetic-Fuzzy [56]
PPS [54] PMCAR [37]
Parallel SPADE [57] PShrFSP-Tree [41]
PISA [28] ShrIO-Tree [41]
TTPC [28] ShaFEM [39]
STPF [58] FP-Array [39]
DTPF [58] Parallel NEclat [23]
CAR-Miner [37] FLR-Mining [35]
GSP [59] DAN-Mining [35]
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
23 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
13. 2
TABLE A-2 (CONTINUED)
THE LIST OF METHODS USED IN DDM
Category Methods Studies Methods Studies
Classifica-
tion
Naive Bayes [60][27] [18] Back Propagation [22]
Naive Bayes + PP [18] PPBP [22]
Complementary Bayes [27] ELM [22]
ID3 [61] [19] [62][63] PPELM [22]
ID3 + PP [19] PRISM [81]
C4.5 Decision Tree [64][46][65][66][67] [68][69][70][62] MOGBAP [82]
C4.5 with Heuristic [65] Bagging [17]
C5.0 Decision Tree [68] Adaboost [17]
J4.8 Decision Tree [71][72] Random Subspace [17]
Decision Tree Construction [52] Stacking [17]
Random Decision Tree [63] Hill Climbing [26]
DMDT [73] Fuzzy Inference Sys-
tem
[83]
Neural Network [72][74][75] M5P [84]
k-nearest neighbor [76][77][52] PRMCLP [85]
SVM [78][79] LVQ3 [74]
NOW G-Net [80] FCNN [74]
PART [71] DROP3 [74]
LDT [66] RSP3 [74]
Genetic Algorithm [24][26] PSO + SVM [86]
Random Forest (RF) [20] [17] EDGAR [38]
PPRF (Random Forest) [20] REGAL [38]
Clustering
k-means [34][76][87][16][27][88][89][90][30]
[77][68][52]
Sequential Samping
Spectral k-means
[30]
Sequential Algorithm [43] CDC –sl [97]
Seq. Algorithm+PP [43] CDC –sl (U) [97]
Mondrian [43] CDC –sl SS [98]
Hilbert-curve [43] CDC –sl VRC [98]
HCA [26] CDC –al [97]
P2P k-means [91][16] [30] CDC –al (U) [97]
Spectral k-means [30] CDC –al SS [98]
DBCURE-MR [93] CDC –al VRC [98]
Epidemic k-means [16] CDC –FEAC [97]
Random P2P k-means [16] CDC –FEAC (U) [97]
Fuzzy k-means [27] CDC –FEAC SS [98]
SSDSC [30] CDC –FEAC VRC [98]
Spectral Clustering [30] CDC –FEAC (10g) [97]
Intelligent Miner [68] CDC –FEAC (10g) (U) [97]
AutoClass [68][92] CDC –FEAC (10g) SS [98]
DBSCAN [46] [93] CDC –FEAC (10g)
VRC
[98]
SOM [94][42] DF –EAC [97]
WR PSOM [42] DF –EAC SS [98]
EM [31][95][76][32][49] DF –EAC VRC [98]
KDEC-S [44] DF –EAC (P) SS [98]
PARC [96] DF –EAC VRC (P) [98]
Agglomerative Methods [34] DBSCAN+CudaSCAN [13]
DPC [34][25] DBSCAN+Cuda_DClu
st
[13]
GMM [32] DBSCAN-GRID-MR [93]
Dirichlet [27] DBCURE-GRID-MR [93]
LDA [27]
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
24 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
14. 3
TABLE A-3
THE NEW OR IMPROVED ALGORITHM PROPOSED IN DDM STUDIES
Categories New Methods
Papers Pa-
pers
Association
Apriori with IDD+HD [27] Privacy Preserving Distributed Association Rule
Mining
[20]
Equal Working Set (EWS) [38] ShaFEM [28]
Request On Demand (ROD) [38] Tidset-based Parallel FP-Tree (TPFP) [36]
Small Size Working Set (SSWS) [38] Balanced Tidset-based Parallel FP-Tree (BTPFP) [36]
Progressive Size Working Set (PSWS) [38] Tree-based DUFIM [14]
PMCAR [71] PShrFSP [74]
P2P-FISM [73] FLR-Mining [70]
Prioritized Sensitive Patterns with Dynamic
Blocking
[69]
Classifica-
tion
PPID3 [18] PPELM [21]
PPNB [17] PPRF [19]
EDGAR [72] WR PSOM [67]
PPBP [21] Harmony Search Ensemble (HS_ENSEM) [16]
Clustering
Sequential sampling spectral k-means [56] Advanced EM [64]
PPSA [75] Epidemic K-Means [15]
KSEC-S (Security) [76] Distributed Progressive Clustering [55]
Advanced EM with Privacy Perserving [62] CudaSCAN [13]
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
25 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
18. 7
TABLE A-5
SCIMAGO JOURNAL RANK OF SELECTED STUDIES
No. Journal’s Name Studies
SJR
(2014)
Q Category
1. ACM Transactions on Database Systems 1 1,729 Q1 (Information Systems)
2. ACM Transactions on Knowledge Discovery from Data 2 2,112 Q1 (Computer Science (Miscellaneous))
3. ACM Transactions on Reconfigurable Technology and
Systems
1 0,401 Q2 (Computer Science (Miscellaneous))
4. IEEE Transactions on Computers 1 1,293 Q1 (Hardware and Architecture; Software; Theoretical
Computer Science)
Q2 (Computational Theory and Mathematics)
5. IEEE Transactions on Cybernetics 1 1,560 Q1 (Software; Computer Science Applications; Human-
Computer Interaction; Information Systems; Control and
Systems Engineering; Electrical and Electronic Engineer-
ing)
6. IEEE Transactions on Dependable and Secure Compu-
ting
2 1,874 Q1 (Electrical and Electronic Engineering)
7. IEEE Transactions on Evolutionary Computation 1 4,407 Q1 (Computational Theory and Mathematics; Software)
Q1 (Theoretical Computer Science)
8. IEEE Transactions on Knowledge and Data Engineer-
ing
3 3,023 Q1 (Computational Theory and Mathematics; Computer
Science Applications; Information Systems)
9. IEEE Transactions on Neural Networks 1 -
10 IEEE Transactions on Parallel and Distributed Systems 1 2,017 Q1 (Computational Theory and Mathematics; Hardware
and Architecture; Signal Processing)
11. IEEE Transactions on Systems, Man, and Cybernetics,
Part B: Cybernetics
1 3,280 Q1 (Human-Computer Interaction; Information Systems;
Software; Computer Science Applications; Electrical and
Electronic Engineering; Medicine (Miscellaneous); Control
and Systems Engineering)
12. Ad Hoc Networks 1 1,197 Q1 (Computer Networks and Communication; Hardware
and Architecture)
Q2 (Software)
13. Applied Soft Computing 2 2,220 Q1 (Software)
14. Computers and Security 1 1,051 Q1 (Computer Science (Miscellaneous); Law)
15. Computers in Biology and Medicine 1 0,474 Q2 (Computer Science Applications; Health Informatics)
16. Data and Knowledge Engineering 4 1,181 Q1 (Information Systems and Management)
17. Engineering Applications of Artificial Intelligence 2 1,525 Q1 (Artificial Intelligence; Control and Systems Engineer-
ing; Electrical and Electronic Engineering)
18. Expert Systems with Applications 7 1,996 Q1 (Artificial Intelligence; Computer Science Applications;
Engineering (Miscellaneous))
19. Future Generation Computer Systems 11 2,164 Q1 (Computer Networks and Communications; Hardware
and Architecture; Software)
20. Information Processing Letters 1 0,904 Q2 (Computer Science Applications; Information Systems;
Signal Processing; Theoretical Computer Science)
21. Information Sciences 4 3,286 Q1 (Theoretical Computer Science; Computer Science Ap-
plications; Artificial Intelligence; Software; Information
Systems and Management; Control and Systems Engineer-
ing
22. Information Systems 1 1,867 Q1 (Hardware and Architecture; Information Systems;
Software)
23. Journal of Computational Science 1 0,848 Q1 (Computer Science (Miscellaneous); Modeling and
Simulation; Theoretical Computer Science)
24. Journal of Network and Computer Applications 1 1,537 Q1 (Computer Networks and Communications; Computer
Science Applications; Hardware and Architecture)
25. Journal of Parallel and Distributed Computing 10 1,093 Q1 (Hardware and Architecture; Computer Networks and
Communications)
Q2 (Software; Artificial Intelligence; Theoretical Computer
Science)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
29 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
19. 8
TABLE A-5 (CONTINUED)
SCIMAGO JOURNAL RANK OF SELECTED STUDIES
No. Journal’s Name Studies
SJR
(2014)
Q Category
26. Journal of Systems and Software 2 1,381 Q1 (Hardware and Architecture; Information Systems;
Software)
27. Knowledge-Based Systems 3 2,190 Q1 (Artificial Intelligence; Information Systems and Man-
agement; Management Information Systems; Software)
28. Neurocomputing 5 1,211 Q1 (Computer Science Applications)
Q2 (Artificial Intelligence; Cognitive Neuroscience)
29. Parallel Computing 3 1,232 Q1 (Software; Theoretical Computer Science; Computer
Networks and Communications; Hardware and Architec-
tures; Computer Graphic and Computer-Aided Design)
Q2 (Artificial Intelligence)
30. Pattern Recognition 1 2,477 Q1 (Artificial Intelligence; Computer Vision and Pattern
Recognition; Signal Processing; Software)
31. Pattern Recognition Letters 3 1,294 Q1 (Computer Vision and Pattern Recognition; Signal Pro-
cessing; Software)
Q2 (Artificial Intelligence)
32. Procedia - Social and Behavioral Sciences 1 0,156 -
33. Procedia Computer Science 3 - -
34. VLDB Journal 2 2,558 Q1 (Hardware and Architecture; Information Systems)
Total 85
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 5, May 2016
30 https://sites.google.com/site/ijcsis/
ISSN 1947-5500