Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. Bigdata includes data from email, documents, pictures, audio, video files, and other sources that do not fit into a relational database. This unstructured data brings enormous challenges to Bigdata.The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. Therefore, big data implementations need to be analyzed and executed as accurately as possible. The proposed model structures the unstructured data from social media in a structured form so that data can be queried efficiently by using Hadoop MapReduce framework. The Bigdata mining is essential in order to extract value from massive amount of data. MapReduce is efficient method to deal with Big data than traditional techniques.The proposed Linguistic string matching Knuth-Morris-Pratt algorithm and K-Means clustering algorithm gives proper platform to extract value from massive amount of data and recommendation for user.Linguistic matching techniques such as Knuth–Morris–Pratt string matching algorithm are very useful in giving proper matching output to user query. The K-Means algorithm is one which works on clustering data using vector space model. It can be an appropriate method to produce recommendation for user
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
A Survey of Agent Based Pre-Processing and Knowledge RetrievalIOSR Journals
Abstract: Information retrieval is the major task in present scenario as quantum of data is increasing with a
tremendous speed. So, to manage & mine knowledge for different users as per their interest, is the goal of every
organization whether it is related to grid computing, business intelligence, distributed databases or any other.
To achieve this goal of extracting quality information from large databases, software agents have proved to be
a strong pillar. Over the decades, researchers have implemented the concept of multi agents to get the process
of data mining done by focusing on its various steps. Among which data pre-processing is found to be the most
sensitive and crucial step as the quality of knowledge to be retrieved is totally dependent on the quality of raw
data. Many methods or tools are available to pre-process the data in an automated fashion using intelligent
(self learning) mobile agents effectively in distributed as well as centralized databases but various quality
factors are still to get attention to improve the retrieved knowledge quality. This article will provide a review of
the integration of these two emerging fields of software agents and knowledge retrieval process with the focus
on data pre-processing step.
Keywords: Data Mining, Multi Agents, Mobile Agents, Preprocessing, Software Agents
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
A Survey of Agent Based Pre-Processing and Knowledge RetrievalIOSR Journals
Abstract: Information retrieval is the major task in present scenario as quantum of data is increasing with a
tremendous speed. So, to manage & mine knowledge for different users as per their interest, is the goal of every
organization whether it is related to grid computing, business intelligence, distributed databases or any other.
To achieve this goal of extracting quality information from large databases, software agents have proved to be
a strong pillar. Over the decades, researchers have implemented the concept of multi agents to get the process
of data mining done by focusing on its various steps. Among which data pre-processing is found to be the most
sensitive and crucial step as the quality of knowledge to be retrieved is totally dependent on the quality of raw
data. Many methods or tools are available to pre-process the data in an automated fashion using intelligent
(self learning) mobile agents effectively in distributed as well as centralized databases but various quality
factors are still to get attention to improve the retrieved knowledge quality. This article will provide a review of
the integration of these two emerging fields of software agents and knowledge retrieval process with the focus
on data pre-processing step.
Keywords: Data Mining, Multi Agents, Mobile Agents, Preprocessing, Software Agents
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
We are living in a world, where a vast amount of digital data which is called big data. Plus as the world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major influence on the Big Data landscape. The analysis of such big data brings ahead business competition to the next level of innovation and productivity.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract Learning Analytics by nature relies on computational information processing activities intended to extract from raw data some interesting aspects that can be used to obtain insights into the behaviors of learners, the design of learning experiences, etc. There is a large variety of computational techniques that can be employed, all with interesting properties, but it is the interpretation of their results that really forms the core of the analytics process. As a rising subject, data mining and business intelligence are playing an increasingly important role in the decision support activity of every walk of life. The Variance Rover System (VRS) mainly focused on the large data sets obtained from online web visiting and categorizing this into clusters according some similarity and the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company, so as to take optimized and beneficial decisions of business expansion. Keywords: Analytics, Business intelligence, Clustering, Data Mining, Standard K-means, Optimized K-means
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy Preserving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Identification of Reserved Energy Resource Potentials for Nigeria Power Gener...IJERA Editor
Electrical power is the most widely used form of power in the industrialized countries. In Nigeria the epileptic pattern of electricity supply has affected every aspect of our economy and therefore required a strong political will and commitment on the part of Government to tackle. The solution to this problem lies in identifying and harnessing the abundant reserve energy resources available in various locations all over the country. This paper thus dwelt on identifying various sources of reserve energy potentials that abound in Nigeria which when harnessed and deployed appropriately will be sufficient to provide for both immediate and future electric power need of the country. The approach deployed in the study include the review of available statistical data of Nigeria reserve energy resources; identify the scale of its availability, location and the realizable amount of electric power from such reserve. The results show that the proven and estimated reserved energy resources of coal, natural gas and new hydro potentials could contribute a total of 96,079.40MW to the grid system, and when added to the existing installed electric power generation capacity of 12,066MW will give a total of 108,145.40MW.
Understanding Construction Workers’ Risk Decisions Using Cognitive Continuum ...IJERA Editor
During the course of performing daily tasks, construction workers encounter numerous hazards, such as ladders that are too short to reach the work area, energized electrical lines, or inadequate fall protection. When a hazard is encountered, the worker must make a rapid decision about how to respond and whether to take or avoid the risk. The goal of this research was to construct a theory about the influence of decision cues on intuitive and deliberative decision-making in high-hazard construction environments. Drawing from Cognitive Continuum Theory, the study specifies a framework for understanding why and how construction workers make decisions that lead to taking or avoiding physical risks when they encounter daily hazards. A secondary aim of the research was to construct a set of hypotheses about how specific decision cues influence whether a worker is more likely to engage their intuitive impulses or to use careful deliberation when responding to a hazard. These hypotheses are described and the efficacy of the hypotheses was evaluated using cross-tabulations and nonparametric measures of association. While most of the associations between decision cues and decision mode (i.e., intuition or deliberation) identified in this data set were generally modest, none of the associations were statistically zero, thus indicating that further research is warranted based on theoretical grounds. A rigorous program of theory testing is the next logical step to the research.
Influence of Ruthenium doping on Structural and Morphological Properties of M...IJERA Editor
The present work examines the effect of Ru doping on MoO3 thin films on steel substrate deposited by Sol-gel spin coat method. The annealing temperature was 6000C for pure MoO3 and 8000C for Ru doped thin films. The doping concentration of Ru was varied from 10 to 50wt%. The influence of Ru doping on structural and morphological properties of MoO3 thin films were studied. The XRD revealed that all films are highly crystalline in nature with monoclinic phase for molybdenum peaks. In the doped XRD pattern some new peaks were observed and are matched with ruthenium orthorhombic phase indicating an incorporation of dopant in pure molybdenum oxide. The same is confirmed with the compositional analysis by EDAX. The SEM images of the MoO3 resemble a rod like surface with porous morphology. Incorporation of Ru ions in molybdenum oxide decreases the length of the rods and vanishes after 40wt%. Tetragonal grain size increases from 20wt% of Ru and becomes maximum at 50wt% of Ru doped thin films
We are living in a world, where a vast amount of digital data which is called big data. Plus as the world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major influence on the Big Data landscape. The analysis of such big data brings ahead business competition to the next level of innovation and productivity.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract Learning Analytics by nature relies on computational information processing activities intended to extract from raw data some interesting aspects that can be used to obtain insights into the behaviors of learners, the design of learning experiences, etc. There is a large variety of computational techniques that can be employed, all with interesting properties, but it is the interpretation of their results that really forms the core of the analytics process. As a rising subject, data mining and business intelligence are playing an increasingly important role in the decision support activity of every walk of life. The Variance Rover System (VRS) mainly focused on the large data sets obtained from online web visiting and categorizing this into clusters according some similarity and the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company, so as to take optimized and beneficial decisions of business expansion. Keywords: Analytics, Business intelligence, Clustering, Data Mining, Standard K-means, Optimized K-means
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy Preserving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Identification of Reserved Energy Resource Potentials for Nigeria Power Gener...IJERA Editor
Electrical power is the most widely used form of power in the industrialized countries. In Nigeria the epileptic pattern of electricity supply has affected every aspect of our economy and therefore required a strong political will and commitment on the part of Government to tackle. The solution to this problem lies in identifying and harnessing the abundant reserve energy resources available in various locations all over the country. This paper thus dwelt on identifying various sources of reserve energy potentials that abound in Nigeria which when harnessed and deployed appropriately will be sufficient to provide for both immediate and future electric power need of the country. The approach deployed in the study include the review of available statistical data of Nigeria reserve energy resources; identify the scale of its availability, location and the realizable amount of electric power from such reserve. The results show that the proven and estimated reserved energy resources of coal, natural gas and new hydro potentials could contribute a total of 96,079.40MW to the grid system, and when added to the existing installed electric power generation capacity of 12,066MW will give a total of 108,145.40MW.
Understanding Construction Workers’ Risk Decisions Using Cognitive Continuum ...IJERA Editor
During the course of performing daily tasks, construction workers encounter numerous hazards, such as ladders that are too short to reach the work area, energized electrical lines, or inadequate fall protection. When a hazard is encountered, the worker must make a rapid decision about how to respond and whether to take or avoid the risk. The goal of this research was to construct a theory about the influence of decision cues on intuitive and deliberative decision-making in high-hazard construction environments. Drawing from Cognitive Continuum Theory, the study specifies a framework for understanding why and how construction workers make decisions that lead to taking or avoiding physical risks when they encounter daily hazards. A secondary aim of the research was to construct a set of hypotheses about how specific decision cues influence whether a worker is more likely to engage their intuitive impulses or to use careful deliberation when responding to a hazard. These hypotheses are described and the efficacy of the hypotheses was evaluated using cross-tabulations and nonparametric measures of association. While most of the associations between decision cues and decision mode (i.e., intuition or deliberation) identified in this data set were generally modest, none of the associations were statistically zero, thus indicating that further research is warranted based on theoretical grounds. A rigorous program of theory testing is the next logical step to the research.
Influence of Ruthenium doping on Structural and Morphological Properties of M...IJERA Editor
The present work examines the effect of Ru doping on MoO3 thin films on steel substrate deposited by Sol-gel spin coat method. The annealing temperature was 6000C for pure MoO3 and 8000C for Ru doped thin films. The doping concentration of Ru was varied from 10 to 50wt%. The influence of Ru doping on structural and morphological properties of MoO3 thin films were studied. The XRD revealed that all films are highly crystalline in nature with monoclinic phase for molybdenum peaks. In the doped XRD pattern some new peaks were observed and are matched with ruthenium orthorhombic phase indicating an incorporation of dopant in pure molybdenum oxide. The same is confirmed with the compositional analysis by EDAX. The SEM images of the MoO3 resemble a rod like surface with porous morphology. Incorporation of Ru ions in molybdenum oxide decreases the length of the rods and vanishes after 40wt%. Tetragonal grain size increases from 20wt% of Ru and becomes maximum at 50wt% of Ru doped thin films
Solar Module Modeling, Simulation And Validation Under Matlab / SimulinkIJERA Editor
Solar modules are systems which convert sunlight into electricity using the physics of semiconductors. Mathematical modeling of these systems uses weather data such as irradiance and temperature as inputs. It provides the current, voltage or power as outputs, which allows plot the characteristic giving the intensity I as a function of voltage V for photovoltaic cells. In this work, we have developed a model for a diode of a Photovoltaic module under the Matlab / Simulink environment. From this model, we have plotted the characteristic curves I-V and P-V of solar cell for different values of temperature and sunlight. The validation has been done by comparing the experimental curve with power from a solar panel HORONYA 20W type with that obtained by the model.
Design Optimization Of Chain Sprocket Using Finite Element AnalysisIJERA Editor
Chain sprocket is one of the important component of chain drive for transmitting power from one shaft to another. To ensure efficient power transmission chain sprocket should be properly designed and manufactured. There is a possibility of weight reduction in chain drive sprocket. In this study, chain sprocket is designed and analysed using Finite Element Analysis for safety and reliability. ANSYS software is used for static and fatigue analysis of sprocket design. Using these results optimization of sprocket for weight reduction has been done. As sprocket undergo vibration, modal analysis is performed
A study on the Noise Radiation of a Power Pack for Construction EquipmentIJERA Editor
The Power Pack of these machines is composed of an engine, fuel tank and oil tank. Energy is transferred to the Casing Rotor through the engine inside the Power Pack. In this study, the generated noise of a Power Pack was predicted, to provide suggestions on how to improve the noise level. First, we constructed a 3D model of the Power Pack. A finite element analysis was performed using ANSYS. We then analysed the Power Pack through the generated acoustic analysis. Finally, we suggest a way to reduce the noise during the design stage using the analysis results.
Effects of Zno on electrical properties of Polyaniline CompositesIJERA Editor
In the present investigation, Polyaniline / Zinc oxide with various weight percentage of Zinc oxide (10%, 20%, 30, 40% and 50%) were synthesized by in-situ polymerization method. The prepared composites were characterized by X-Ray diffraction (XRD), Scanning Electron Microscopy (SEM) and Fourier Infrared Spectroscopy (FTIR). The dc conductivity of the samples was measured as a function of temperature in the range 30-180oC and it was found that increasing the concentration of ZnO particles increases the conductivity. Ac conductivity of the composites was studied with respect to frequency
Speed Control of Induction Motor by V/F MethodIJERA Editor
This paper presents the design & implementation of voltage & frequency ratio constant & controller based on sinusoidal pulse width modulation technique for a single phase induction motor using fuzzy logic. The work involves implementation of an closed loop control scheme for an induction motor. The technique is used extensively in the industry as it provides the accuracy required at minimal cost. V/f controlled motors fall under the category of variable voltage variable frequency drives. The ratio of voltage & frequency must be constant.
Strategic Planning of Water System Projects in AlexandriaIJERA Editor
Alexandria is one of the major cities on the Mediterranean Sea. Over the past 40 years, Alexandria‟s population has doubled. Therefore Water requirements are continuously increasing due to population increase. This paper develops a framework to support decision-makers in water sector for planning major projects in Alexandria till 2037. Firstly, data gathering has been conducted and population forecasting is calculated by arithmetic and geometric methods then the future water demands are calculated, after that major projects outline is proposed. Finally the projects priorities will be determined by applying two methods of solving Multiple Criteria Decision Making MCDM problems. The first method is The Weighted Scoring Method; WSM is a powerful and flexible method of comparing similar items against a standard, prioritized list of requirements or criteria. The second method is Analytical Hierarchy Process. AHP is based on comparative evaluation method. Then Results will be analyzed. First, it was focused on the difference of the criteria weight of alternatives between the two methods. Second, it was compared the preference orders of alternatives between them, there were not much of differences in the final results. The results offered some evidence that AHP makes the selection process very transparent.
A Review of Strategies to Promote Road Safety in Rich Developing Countries: t...IJERA Editor
Road safety policies, strategies and action plans, along with trends in road traffic injuries (RTIs) in the oil-rich Gulf Cooperative Council (GCC) countries were examined to appraise their road safety work with an overall objective of identifying key measures and initiatives that would reduce RTA and their resulting consequences in these countries. Data on RTIs was obtained from police and from vital statistics and was analyzed. Research papers, policy documents, and strategies, obtained from relevant stakeholders in the six GCC countries, were reviewed and discussed. Traffic Safety Programs and action plans, which were the most fundamental documents in the development of the GCC countries’ road safety policies and strategies, were reviewed. Policy documents on road safety and traffic related issues were searched on the websites of related authorities. Published research on road safety in GCC countries was searched using available databases. Analysis of accident data shows that the fatality rates in all the GCC countries are much higher than developed countries with good safety records. The six administrations started the fundamental traffic safety programs to combat the increase in RTIs, with some succeeding in reducing RTI rates by implementing vast road safety improvements. However, RTIs increased again mainly because of increasing traffic volume and high-risk driving behavior. Developing and implementing national road safety strategies in some GCC countries was successful in reducing the RTI rates. The road safety situation in the six GCC countries was assessed showing high crash and fatality rates compared to developed countries. Most GCC countries still suffer from sustainable increase in traffic crashes despite the efforts to reduce their magnitude and severity. Some of these countries have developed and implemented national road safety strategies, while countries like Oman still need to develop such a long-term strategy. Following the review of the current progress in road safety initiatives developed or implemented, it is apparent that there is still considerable room for improvement. In view of the fact that the oil-rich GCC countries have similar economic, social, and political background, a number of specific areas of action common to all countries were identified to achieve a safer road environment in the studied countries
Wake-up-word speech recognition using GPS on smart phoneIJERA Editor
Wake-Up-Word (WUW) is a new prototype of speech recognition not widely recognized. Lately, the use of GPS is widely increased in everyday life that means that our necessities have changed. We can use a new paradigm in controlling the voice of a map in the digital era. This would bring benefit for people while driving a car. In this paper we present a set of voice commands to integrate within the map and navigation voice control. Using a voice control for Global Positioning System (GPS) helps to determine and track the precise location using a technology called Google API. The benefit of this application would be avoiding car accidents using speech command instead of typing.
A Review on Reversible Data Hiding Scheme by Image Contrast EnhancementIJERA Editor
In present world demand of high quality images, security of the information on internet is one of the most important issues of research. Data hiding is a method of hiding a useful information by embedding it on another image (cover image) to provide security and only the authorize person is able to extract the original information from the embedding data. This paper is a review which describes several different algorithms for Reversible Data Hiding (RDH). Previous literature has shown that histogram modification, histogram equalization (HE) and interpolation are the most common methods for data hiding. To improve security these methods are used in encrypted images. This paper is a comprehensive study of all the major reversible data hiding approaches implemented as found in the literature.
Structural and Morphological Properties of Mn-Doped Co3O4 ThinFilm Deposited ...IJERA Editor
In this study, a series of manganese (Mn)-doped Cobalt oxide (Co3O4)thin films were deposited on steel substrate by the sol-gel spin coat method and investigated the influence of doping concentrations of Mn in Cobalt ranging from 0.001% to 1% on physical, structural and morphological properties of Co3O4 thin films. Cobalt acetate[(CH3COO)2Co.4H2O], Mn acetate [C4H6MnO4.4H2O] and Isopropyl alcohol were used as starting material, dopant source and reagent respectively.X-ray diffraction analysis indicated that pureCo3O4 thin film iscrystallinein nature andcubic phase with [400] preferential orientation.For Mn doped films, three new peaks corresponding to the planes [310], [320] and [420] of orthorhombic MnO2 phase were observed.SEM micrographs showed that incorporation of Mn in Co site was found to influence the surface morphology of the films. All the films showed tetragonal shaped grains. TheEDAXanalysis revealedthe amount of Mn element in the sample increased with increasing dopant concentration.
Mechanical Properties of Sustainable Adobe Bricks Stabilized With Recycled Su...IJERA Editor
In the pursuit of cheaper and more sustainable building materialsto meethousing demands in developing countries like Cameroun, the mechanicalproperties ofadobe bricks which have been stabilized with recycled sugarcane fiber waste were investigated. Laboratory experiments were conducted using sugarcanefiber waste stabilized adobe brick specimens with fiber proportions of 0%, 0.3%, 0.6%, 1.2%, 2% and 3% by weight.Fiber stabilization increased compressive strength by 58.61% for 3% bricks, reaching 4.79 MPa.Further, 3% fiber stabilized bricks shrunk by 7.49%, while the non-stabilized bricksshrunk by 12.13%. Also, 3% bricks lasted for one week before deterioration when immersed in water, while the non-stabilized bricks lasted for only a few hours. The findings confirmed that sugarcane fiber waste stabilized adobe bricks have improved strength, durability and stability. The use of abandoned sugarcane fiber waste in adobe bricks will contribute to the development of more durable, sustainable and stronger adobe brick structures, as well as reduce the environmental and economic challenges associated with the disposal of sugarcane waste.
Comparison of Total Actual Cost for Different Types of Lighting Bulbs Used In...IJERA Editor
This paper presents a comparison of actual cost for three different lighting bulbs typically used in houses. They are Incandescent light, Compact Florescent Light CFL and Light Emitted Diode Light LED. The comparison takes in respect how much each lighting bulb consumes actual power and how much it costs in respect of Kuwait’s regulation. Thus, the cost efficiency of each type is calculated. The study deals with typical houses in Kuwait taking in respect the regulation of the distribution and installation rules (R6), [1] of the ministry of water and electricity in Kuwait for the year 2014
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MININGcscpconf
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
Big data is a prominent term which characterizes the improvement and availability of data in all three formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file includes text and multimedia contents. The primary objective of this big data concept is to describe the extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V” dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity. Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is described with the types of the data, Value which derives the business value and Veracity describes about the quality of the data and data understandability. Nowadays, big data has become unique and preferred research areas in the field of computer science. Many open research problems are available in big data and good solutions also been proposed by the researchers even though there is a need for development of many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper, a detailed study about big data, its basic concepts, history, applications, technique, research issues and tools are discussed.
We have concentrated on a range of strategies, methodologies, and distinct fields of research in this article, all of which are useful and relevant in the field of data mining technologies. As we all know, numerous multinational corporations and major corporations operate in various parts of the world. Each location of business may create significant amounts of data. Corporate decision-makers need access to all of these data sources in order to make strategic decisions.
The advent of social networks has changed the research in computer science. Now, the massive volume of data has present in the form of twitter, facebook, emails, IOT (Internet of Things). So, the storage and analysis of these data has become great challenge for researchers. Traditional frameworks have failed for the processing of large data. R is open source programming framework developed for the analysis of large data results better accuracy. It also gives the opportunity of the implementation in R programming language. In this paper, a study on the use of R for the classification of large social network data. Naïve Bayes algorithm is used for the classification of large twitter data. The experiment has shown that enormous amount of data can be sufficiently classified using the R framework with promising results.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
1. Web Mining – Web mining is an application of data mining for di.docxbraycarissa250
1. Web Mining – Web mining is an application of data mining for discovering data patterns from the web. Web mining is of three categories – content mining, structure mining and usage mining. Content mining detects patterns from data collected by the search engine. Structure mining examines the data which is related to the structure of the website while usage mining examines data from the user’s browser. The data collected through web mining is evaluated and analyzed using techniques like clustering, classification, and association. It is a very good topic for the thesis in data mining.
2. Predictive Analytics – Predictive Analytics is a set of statistical techniques to analyze the current and historical data to predict the future events. The techniques include predictive modeling, machine learning, and data mining. In large organizations, predictive analytics help businesses to identify risks and opportunities in their business. Both structured and unstructured data is analyzed to detect patterns. Predictive Analysis is a lengthy process and consist of seven stages which are project defining, data collection, data analysis, statistics, modeling, deployment, and monitoring. It is an excellent choice for research and thesis.
3. Oracle Data Mining – Oracle Data Mining, also referred as ODM, is a component of Oracle Advanced Analytics Database. It provides powerful data mining algorithms to assist the data analysts to get valuable insights from data to predict the future standards. It helps in predicting the customer behavior which will ultimately help in targeting the best customer and cross-selling. SQL functions are used in the algorithm to mine data tables and views. It is also a good choice for thesis and research in data mining and database.
4. Clustering – Clustering is a process in which data objects are divided into meaningful sub-classes known as clusters. Objects with similar characteristics are aggregated together in a cluster. There are distinct models of clustering such as centralized, distributed. In centroid-based clustering, a vector value is assigned to each cluster. There are various applications of clustering in data mining such as market research, image processing, and data analysis. It is also used in credit card fraud detection.
5. Text mining – Text mining or text data mining is a process to extract high-quality information from the text. It is done through patterns and trends devised using statistical pattern learning. Firstly, the input data is structured. After structuring, patterns are derived from this structured data and finally, the output is evaluated and interpreted. The main applications of text mining include competitive intelligence, E-Discovery, National Security, and social media monitoring. It is a trending topic for the thesis in data mining.
6. Fraud Detection – The number of frauds in daily life is increasing in sectors like banking, finance, and government. Accurate detection of fraud is a challenge. Da.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Similar to Frequent Item set Mining of Big Data for Social Media (20)
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Frequent Item set Mining of Big Data for Social Media
1. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 18|P a g e
Frequent Item set Mining of Big Data for Social Media
Roshani Pardeshi,Prof. Madhu Nashipudimath
Pillai Institute of Engineering and Technology, New Panvel, India
Pillai Institute of Engineering and Technology, New Panvel, India
ABSTRACT
Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of
storing, analyzing and visualizing for further processes or results. Bigdata includes data from email, documents,
pictures, audio, video files, and other sources that do not fit into a relational database. This unstructured data
brings enormous challenges to Bigdata.The process of research into massive amounts of data to reveal hidden
patterns and secret correlations named as big data analytics. Therefore, big data implementations need to be
analyzed and executed as accurately as possible.
The proposed model structures the unstructured data from social media in a structured form so that data can be
queried efficiently by using Hadoop MapReduce framework. The Bigdata mining is essential in order to extract
value from massive amount of data. MapReduce is efficient method to deal with Big data than traditional
techniques.The proposed Linguistic string matching Knuth-Morris-Pratt algorithm and K-Means clustering
algorithm gives proper platform to extract value from massive amount of data and recommendation for
user.Linguistic matching techniques such as Knuth–Morris–Pratt string matching algorithm are very useful in
giving proper matching output to user query. The K-Means algorithm is one which works on clustering data
using vector space model. It can be an appropriate method to produce recommendation for user.
Index Terms: Big data, K-Means Clustering, Frequent Itemset Mining, Linguistic Matching, Recommendation
I. INTRODUCTION
Social media is widely used communication
media. Site such as Twitter, You tube, Facebook,
LinkedIn are frequently used by people [1]. It
includes unstructured data from email, documents,
pictures, audio, video files, and other sources. To
manage such massive unstructured data and construct
an efficient, user-friendly structured data, both
context and usage pattern of data items are used [2].
There has been an unprecedented increase in the
quantity and variety of data generated worldwide.
According to the IDC‟s Digital Universe study, the
world‟s information is doubling every two years and
is predicted to reach 40ZB by 2020. Such vast
datasets are commonly referred to as “Big Data” [3].
Big data
characterized by 3Vs: the extreme volume of
data, the wide variety of types of data and the velocity
at which the data must be must processed. Big data
doesn't refer to any specific quantity. Big data is a
term used when speaking about petabytes and
Exabyte‟s of data, which cannot be integrated easily.
Hadoop framework is used to deal with such massive
data. Hadoop works on MapReduce framework.
MapReduce developed by Google along with Hadoop
distributed file system is exploited to find out
frequent itemset from Big Data on large clusters.
MapReduce framework has two phases, Map phase
and Reduce phase. Map and reduce functions are used
for large parallel computations specified by users.
Map function takes chunk of data from HDFS in
(key, value) pair format and generates a set of (key‟,
value‟) intermediate (key, value) pairs [1].
Figure 1 : 3 vs of Big Data
Data mining is an essential technique to
extract information from large database. For
competitive business world it is very necessary to
search through this huge data to grasp useful
information. To identify the customer behavior, their
choices, to know there feedback for product etc. big
data has to be mined by using appropriate mining
methods. Frequent itemset mining is method to
identify the itemset which appear frequently in a
dataset [1].
RESEARCH ARTICLE OPEN ACCESSOPEN
ACCESS
2. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 19|P a g e
Frequent Itemset Mining Methods
There are different frequent itemset mining methods
such as
Association, correlation analysis,
Sequential, Structural pattern.
Classification.
Clustering. etc.
Association, correlation analysisIt is an
important data mining model used to find the
interesting relationship between the data in the
database. It is mainly used for the Market basket
analysis to help improve the business activity.
Association rules are obtained using two main criteria
the support and the confidence. The support indicates
how frequently the items appear in the database.
Association mining discovers the relation of items in
same transaction [2].
Sequential, Structural Pattern It is to discover set
of frequent subsequence in a transactional database.
The sequential pattern mining is a very important
concept of data mining. In a sequential pattern
mining, events are linked with time. It discovers the
correlation between the different transactions.
Classification Classification is the process to predict
class of objects which do not have any class label. If
finds the data classes and concepts. It is based on the
analysis if sets of training data. Classification consists
of predicting a certain outcome based on a given
input. In order to predict the outcome, the algorithm
processes a training set containing a set of attributes
and the respective outcome, usually called goal or
prediction attribute. The algorithm tries to discover
relationships between the attributes that would make
it possible to predict the outcome.
ClusteringClustering is method of grouping similar
and related objects together. It is used to classify
unstructured and semi-structured dataset. It is
unsupervised learning technique. Clustering
algorithm are broadly classified into hierarchical,
partition and density based clustering. K-Means
clustering algorithm is one of the techniques to
provide structure to unstructured data.K-Means is
traditional and widely used partitioned clustering
algorithm. It has asymptotic running time with
respect to any variable of the problem [6].
II. LITERATURE SURVEY
The traditional data is in structured format so
it can be easily stored and queried. However
nowadays data is not in typical structured format. The
data uploaded on Facebook, Twitter, and YouTube
etc. is completely unstructured. It is in format of
images, video, audio, documents which are
unstructured [3]. Traditional database cannot handle
this unstructured data. The data generated in huge
amount, with high velocity, and variety. Therefore it
termed as Big data. There are different data mining
technique‟s which are used to extract useful
information. Linguistic matching techniques are used
to give structure by identifying similarities between
elements [2]. It can be applied to different data
sources to avoid overlap.
Ashwin Kumar et.al [2] “Efficient
structuring of data in Big Data” discusses the
structuring of big data using linguistic matching. The
data is then clustered using Markov‟s clustering
algorithm. It uses techniques of string matching,
tokenization matching stemming to find similarity in
data items. Establishing the relationships or logical
mapping among elements from different data sources
is schema matching. Linguistic matching uses
element name or description to find matching. Graph
matching contains two algorithm such as fixed point
computation on similarity and probabilistic constraint
satisfaction algorithm. In constraint- based matching
the properties of element like uniqueness, data types,
value range etc. are considered. Proposes Markova‟s
clustering algorithm to cluster data from different
data sources. Data is analyzed for similarity between
the clusters.
Frequent Itemset mining is very essential
technique to extract useful information from Big data.
There are many algorithms present which describes
different methods to produce frequent itemset. But
existing algorithm cannot efficiently deal with Big
data [1]. Hadoop ecosystem is the solution for Big
data. It has two basic components MapReduce and
HDFS. The existing algorithms have challenges while
dealing with Big Data. MapReduce has Map,
Combine and Reduce functions which uses (key,
value) pair. Distance between sample point and
random centers are calculated for all points using map
function. Intermediate output values from map
function are combined using combiner function. All
samples are assigned to closest cluster using reduce
function. K means clustering algorithm extract
frequent itemset very fast, reduce execution time.
Prajeshachalia, Anjan K Koundinya and
Srinath N K [8] “MapReduce Design of K-Means
Clustering Algorithm” Describes the K-Means
algorithm for distributed environment using Hadoop.
Mapper and reducer is designed to implement K-
Means algorithm. Proposed method provides a system
which group the data together with similar
characteristics which reduces implementation cost.
Method can be improve to handling outlier.
MadjidKhalilian, Norwati Mustapha, et al.
[7] “A Novel K-Means Based Clustering Algorithm
for High Dimensional Data Sets”. Proposed method
of divide and conquer technique which improves the
performance of K-Means algorithm for high
dimensional dataset. Experiment results demonstrate
appropriate accuracy and speed up.
Objective of proposed framework is
combining relational definition of clustering space
3. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 20|P a g e
and divide and conquer method has been describe to
overcome difficulties. Based on Proposed method
data stream processing can be improved. Data type is
another direction to examine proposed method. K-
Means has been used for second phase whereas it can
be improve with clustering algorithms.
Sisirkumar and anjan amehanta [4] “An
alternative technique of selecting the initial cluster
centers in the K-Means algorithm for better
clustering”.The problem with using randomly
selected initial centroids,is overcome Furthest-First
technique with K-Means. It produces better quality
clustering by lowering sum of squared errors.
Mugdha Jain and ChkradharVerma [6]
“Adapting K-Means for clustering in Big data”.
Proposed techniques describe an approximate
algorithm based on K-Means. It is a novel method for
big data analysis which is very fast, scalable and has
high accuracy. It overcomes the drawback of K-
Means of uncertain number of iterations by fixing the
number of iterations, without losing the precision.
R. C. Saritha and Dr. M. Usha Rani [15]
“Map Reduce Text Clustering Using Vector Space
Model” describes map reduce approach for clustering
the documents using vector space model. proposed is
efficient with the increase of text corpus.
III. PROPOSED SYSTEM
Proposed model provides approach for
structuring and frequent itemset mining for big data.
There are different methods of frequent itemset
mining.Association, correlation analysis, Sequential,
Structural pattern, Classification, Clustering,
Regression analysis etc. Here we are using clustering
method for frequent itemset mining. Hadoop
Environment is used for structure the data by the help
of MapReduce framework. KMP string matching
algorithm and K-Means clustering algorithm together
can give better recommendation for user query.The
proposed model is given below figure 3.1
Figure 2: Proposed Architecture
Stages in the system
The proposed system consists of following.
A. Data collection
B. Hadoop Environment
C. Data Pre-processing
D. Analyzing context: Linguistic matching(KMP
string matching algorithm)
E. Recommendation : Frequent Itemset Mining (K-
Means Clustering)
A. Dataset
Data is collected from social media. Dataset
will be Movie Dataset of IMDB. IMDB Dataset is in
.txt file format. The dataset with 1682 movie data
containing total 24 attributes like Movie id,
Movie_name, Movie _release year, Movie_release
date, Movie_URl, and Other 19 are genre variable
represents presence of genre as 1 and absences of
genre as 0.Movies in the IMDB dataset are
categorized in genres as follows.
Table 1: List of Genre
Example 1|ToyStory (1995)| 01-Jan-
1995||http://us.imdb.com/M/title-
exact?Toy%20Story%20(1995)
|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
Movie_id is 1
Movie_name is Toy Story
Movie_release Date is 01 –Jan-1995
Toy Story |0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0key
provides genre information about the movie “Toy
Story” is categorized as Animation, Comedy, and
Children‟s.
A. Hadoop Environment
Hadoopis an open-source software
framework for distributed storage and distributed
processing of very large data sets on computer
clusters built from commodity hardware.The core of
Apache Hadoop consists of a storage part, known
as Hadoop Distributed File System (HDFS), and a
processing part called MapReduce. Hadoop splits
files into large blocks and distributes them across
nodes in a cluster [13].
1.Unknow 6.Comedy 11.Film Noir 16. Sci-Fi
2.Action 7.Crime 12. Horror 17. Thriller
3.Adventure 8.Documentary 13.Musical 18. War
4. Animation 9. Drama 14.Mystry 19. Western
5. Children 10.Fantacy 15.Romance
4. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 21|P a g e
HDFS is the primary distributed storage used by
Hadoop applications. Data is loaded into HDFS
System. A HDFS cluster primarily consists of a
NameNode that manages the file system metadata and
DataNodes that store the actual data.
B. Data Pre-processing
Data is further structured by using Map
Reduce framework. Input text file is tokenized in
different word tokens. After tokenizing unstructured
data, we create two modules which interfaced with
Hadoop.First module process the data for structuring
by using KMP algorithm. Second module check for
frequent itemset in dataset. Output from these two
modules is combined [2].
Read input string form text file and tokenize it.
Example
1413 | Street Fighter (1994) | 01-Jan-1994
1408 | Gordy (1995) | 01-Jan-1995
1406 | When Night Is Falling (1995) | 01-Jan-1995
Above lines can be tokenize as follows :
1413, Street Fighter, (1994) ,01-Jan-1994
1408, Gordy, (1995) ,01-Jan-1995
1406, When, Night, Is, Falling,(1995) ,01-Jan-1995
MapReduce Framework
MapReduce framework proposed by Google
is a processing and execution model for distributed
environment that runs on large clusters [1].
MapReduce with Hadoop distributed file system is
used to find out frequent itemset from Big Data on
large clusters.MapReduce framework has two phases
[1] [8].
Map phase: Map phase takes data from HDFS
and generate (Key, Value) pair. Map function
collects all intermediate key - value pair and
passes it to reduce function.
Reduce phase: Reduce function takes these
intermediate values through iterator and merge
it to produce (Key‟, Value‟). By Map Reduce
framework too large values can also fit in to
the memory.
Figure 3: MapReduce Framework [1]
Example
Map 1
1408, 1
Gordy, 1
(1995) , 1
01-Jan-1995 1
Map 2
1406, 1
When, 1
Night, 1
Is, 1
Falling , 1
(1995) , 1
01-Jan-1995 1
Reduce :: (key‟, list (value‟)) → (key‟‟, value‟‟)
Reducer
1408, 1
Gordy, 1
(1995) 2
01-Jan-1995 2
1406, 1
When, 1
Night, 1
Is, 1
Falling , 1
D. Analyzing context: Linguistic matching (KMP
string matching algorithm)
To give recommendation for user on the
basis of user query string matching algorithm is used.
We use Knuth–Morris–Pratt algorithms to find out
the similarity of data items contextually.
The Knuth–Morris–Pratt algorithm searches
for matching pattern in dataset. It takes the pattern
entered by user, checks for sub-pattern matching in
pattern first and then it matches with the dataset. By
this KMP algorithm avoids unnecessary search and
avoid backtracking.
The worst case complexity of Knuth–
Morris–Pratt algorithm is O (n).
For proposed model the Knuth–Morris–Pratt
string matching algorithm produces recommendations
by matching movie names. If user searches for movie
“Star War” then recommendation should be movies
such as “Star War II”.
Knuth–Morris–Pratt string matching algorithm
Detects String Similarities between elements
from different data sources.
To get better recommendation for user.
To find out the similarity of data items
Components of KMP
The prefix function, Π
Prefix function matches pattern against shift of
itself. The function avoids useless shifts of pattern
„p‟and avoids backtracking on string „S‟.
5. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 22|P a g e
Compute-Prefix-Function (Π)
1 m length[p] //‟p‟ pattern to be matched
2 Π[1] 0
3 k 0
4 for q 2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k Π[k]
7 If p[k+1] = p[q]
8 then k k +1
9 Π[q] k
10 ReturnΠ
The KMP Matcher
The occurrence of pattern with prefix function in
string is calculated.
With string „S‟, pattern „p‟ and prefix function „Π‟ as
inputs, finds the occurrence of „p‟ in „S‟.
KMP-Matcher(S,p)
1 n length[S]
2 m length[p]
3 Π Compute-Prefix-Function(p)
4 q 0 //number
of characters matched
5 for i 1 to n //scan S
from left to right
6 do while q> 0 and p[q+1] != S[i]
7 do q Π[q] //next character
does not match
8 if p[q+1] = S[i]
9 then q q + 1 //next character
matches
10 if q = m //is all of p
matched?
11 then print “Pattern occurs with shift” i – m
12 q Π[ q] // look for the next
match
Example
Input Pattern: velvet
Computing Prefix
K 0 0 0 1 2 0
Q 2 3 4 5 6
Output: Blue Velvet (1986
Velvet Goldmine (1998)
Terminal Velocity (1994)
E. Recommendation: Frequent Itemset Mining (K-
Means Clustering)
Cluster is collection of data having similar
characteristics. For unstructured data, there is need to
analyzed to derive meaningful information. K-Means
clustering is technique which provides structure to
unstructured data [8]. Only structuring the data is not
efficient for big data, as volume of data is too big.
Therefor meaningful information extraction is very
important [2]. Various data mining techniques are
available for big data [1]. Clustering is technique we
are using here to mining frequent itemset.
The frequent itemset mining uses K-Means
algorithm which works to find similarity between
data items which are similar in some characteristics
[2]. K-Means is one of the most common and simple,
famous partition clustering algorithm [8][6].
K-MEANSCLUSTERING
K-Means is a common and well-known
clustering algorithm. K-Means works on MapReduce
routine. The input is given as <key,value> pair where
Key is the centroid of cluster and value is data object.
BasicallyK-Means clustering works as follow.
K-Means Clustering Algorithm
Take k random values as cluster centers.
Let each item belong to the cluster whose cluster
center it is closest to.
For each of the k clusters, find a new cluster
center by taking the mean of its items.
Repeat steps 2 and 3 until the cluster centers are
stable.
In Hadoop, K-Means runs in mapper and reducer.
Centroid file with data items are the input to mapper.
The distance between centroid and data object is
evaluated using Euclidian distance measure.
Distance(i,j)
= (𝑥𝑖1 − 𝑥𝑗𝑖 )2 + (𝑥𝑖2 − 𝑥𝑗2)2 + (𝑥𝑖3 − 𝑥𝑗3)2 ….(1)
The data object with minimum distance will go in the
cluster of the centroid. Each remaining objects are
assigned to centroid by distance similarity score. New
mean is calculated for each cluster. Iterate through the
previous steps to find new cluster till no change occur
in cluster centroid.
For „n‟ objects to be assigned into „k‟ clusters, the
algorithm will have to perform a total of „nk‟ distance
computations. While the distance calculation between
any object and a cluster center can be performed in
parallel, each iteration will have to be performed
serially as the center changes will have to be
computed each time.
Algorithm for K-Means Mapper
Input: A set of objects X = {x1, x2,…..,xn},
1 2 3 4 5 6
P V E L V E T
Π 0 0 0 1 2 0
6. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 23|P a g e
A Set of initial Centroids C = {c1, c2, ,ck}
Output: A output list which contains pairs of (Ci, xj)
Where 1 ≤ i ≤ k and 1 ≤j ≤ n
Steps:
1. M1←{x1, x2,………,xm}
2. current_centroids←C
3. distance (p, q) = 𝑝𝑖 − 𝑞𝑖
2
𝑡 𝑖∈𝐶 𝑠
(where pi, qi are is the coordinate of p and q
in dimension i)
for all xi ∈ M1 such that 1≤ i ≤m do
4. bestCentroid←null
minDist←∞
5. for all c ∈ current_centroids do
dist← distance (xi, c)
if (bestCentroid = null || dist<minDist)
then
minDist←dist
bestCentroid ← c
end if
end for
6. emit (bestCentroid, xi)
i+=1
end for
7. return Outputlist
Algorithm for Reducer
Input: (Key, Value), where key = bestCentroid and
Value= Objects assigned to the centroid by themapper
Output: (Key, Value), where key = oldCentroid and
value= newBestCentroid which is the new centroid
value calculated for that bestCentroid
Procedure
1. outputlist←outputlist from mappers
𝝑 ← { }
2. newCentroidList ← null
3. for all β outputlist do
centroid ←β.key
object ←β.value
[centroid] ← object
end for
4. for all centroid ∈𝝑 do
newCentroid, sumofObjects,
sumofObjects← null
for all object ∈𝝑 [centroid] do
sumofObjects += object
numofObjects += 1
end for
5. newCentroid ← (sumofObjects +numofObjects)
emit (centroid, newCentroid)
end for
end
For K-Means clustering algorithm, Input
given as Key, Value pair and number of cluster want
to be form as a four then four cluster are created at
first step and randomly it select four Movies from
input an place each in separate cluster and remaining
are assign to cluster on the basis of similarity and
calculate centroid of each cluster and redistribute
dataset on the basis of similarity to centroid this
process perform recursively till no change is occur.
Example
K-Means Mapper
Input: A set of objects X = {x1, x2,…..,xn},
A Set of initial Centroids C = {c1, c2, ,ck}
Let Centroid is Toy Story:
(0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)
Distance between Points centroid and data item is
evaluated by Euclidian distance as follow.
Toy Story: (0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)
Batman Forever: (0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)
As given in equation (1) Distance is calculated.
Distance= (0 − 1)2 + (0 − 1)2 + (1 − 0)2 + (1 − 0)2 + (0 − 1)2…..
= 2.2360
Home Alone (0, 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)
Distance = (0 − 0)2 … + (1 − 0)2 + 1 − 1 2 + 1 − 1 2 … . .
= 1
True Romance (0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0)
Distance= (0 − 1)2 + 1 − 0 2 + 1 − 0 2 + (1 − 0)2 + (0 − 1)2 + (0 − 1)2…..
= 2.44
As result shows Distance of Centroid with the movies
are as follows
Movie Distance
Batman Forever 2.2360
Home Alone 1
True Romance 2.44
Table 2: Distance Measure
Distance of Movie Home Alone is minimum from
centroid therefore Home Alone will be placed in
cluster with centroid Toy Story.
bestCentroid for Home Alone is Toy Story.
minDist =1
For all centroid and all Objects it is iterated and
depending upon minimum distance the movie will
get placed in appropriate cluster.
7. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 24|P a g e
Output will be (bestCentroid ,object)
K-Means Reducer
Input: (Key, Value), where key = bestCentroid and
Value= Objects assigned to the centroid by themapper
Output: (Key, Value), where key = oldCentroid and
value= newBestCentroid which is the new centroid
value calculated for that bestCentroid
Recommendation:
If user searches for movie with genre like animation ,
comedy as recommendation output should be
matching genre top five movies.
IV. CONCLUSION AND FUTURE
WORK
Conclusion
The proposed model handles data context
analysis and frequent itemset mining to generate an
efficient structure for unstructured data in Big data.
Proposed method can minimize the difficulties faced
by the user of Big data. User would require to invest a
lot of time in studying the unstructured data. By using
Hadoop Map Reduce Framework, proposed method
tried to minimize the time required by user of Big
data to investigate the unstructured data. Proposed
approach is able to structure data which can make the
data more usable by the user and the recommendation
system gives appropriate recommendation to user
using K-Means algorithm. The proposed model can
find matching movies set using KMP string matching
algorithm. K-Means clustering algorithm on Map
Reduce framework is used for clustering the data,
which will help user to get better recommendation.
Future Work
As future work our approach can be expand
to meet other demands of big data such as velocity.
Various other attributes of movies can be used to
produce more accurate recommendation. Frequent
item set mining algorithm and map reduce framework
on stream of data which can be real time insight in
big data.
REFERENCES
[1]. SheelaGole and Bharat Tidke.”Frequent
itemset mining for Big Data in social
media using ClustBigFIM algorithm”. IEEE
International conference on Pervasive
Computing (ICPC) 2015.
[2]. Ashwin Kumar T K, Hong Liu, and Johnson
P Thomas. ”Efficient structuring of data in
Big Data”. IEEE International Conference
on Data Science and Engineering (ICDSE)
2014.
[3]. Han hu, Yonggang wen, (senior member,
IEEE), Tat-sengchua, And Xuelong li,
(Fellow, IEEE) “Toward Scalable Systems
for Big Data Analytics: A Technology
Tutorial”. VOLUME 2, 2014
[4]. Sisir Kumar Rajbongshi and
AnjanaKakotiMahanta.” An Alternative
Technique of Selecting the Initial Cluster
Centers in the K-Means Algorithm for Better
Clustering”.International Journal of
Computer Applications April 2013.
[5]. Xingjian Li” An Algorithm for Mining
Frequent Itemsets from Library Big
Data”.Journal Of Software, Vol. 9, No. 9,
September 2014.
[6]. Mugdha Jain ChakradharVerma“
AdaptingK-Means for Clustering in Big
Data” International Journal of Computer
Applications (0975 – 8887) Volume 101–
No.1, September 2014.
[7]. DhamdhereJyoti L., Prof. DeshpandeKiran
B. “A Novel Methodology of Frequent
Itemset Mining on Hadoop”.International
Journal of Emerging Technology and
Advanced Engineering ,JULY 2014.
[8]. Prajesh P Anchalia, Anjan K Koundinya,
Srinath N K “MapReduce Design of K-
Means Clustering Algorithm”IEEE 2013.
[9]. Ronghu,wanchundou,jianxunliu.”ClubCF: A
clustering –based Collaborative Filtering
Approach for Big Data Application” IEEE
SEPTEMBER 2014.
[10]. NawsherKhan,IbrarYaqoob,IbrahimAbakerT
argioHashem,ZakiraInayat,WaleedKamaleld
inMahmoudAli,MuhammadAlam,Muhamma
dShiraz,and Abdullah Gani “Big Data:
Survey, Technologies, Opportunities, and
Challenges”.
[11]. Victoria L´opez, Sara delR´ıo, Jos´e Manuel
Ben´ıtez and Francisco Herrera “On the use
of MapReduce to build Linguistic Fuzzy
Rule Based Classification Systems for Big
Data”.2014 IEEE International Conference
on Fuzzy Systems.
[12]. Ms. Vibhavari Chavan, Prof. Rajesh. N.
Phursule “Survey Paper on Big Data”
International Journal of Computer Science
and Information Technologies, Vol. 5 (6)
2014.
[13]. Hadoop tutorial viewed [Online] Available
at:
http://www.tutorialspoint.com/hadoop/hadoo
p_big_data_overview.htm(Last accessed on
20 August, 2016 at 9 PM)
[14]. Hadoop command guide [Online] Available
on:http://hadoop.apache.org/docs/current/ha
doopprojectdist/hadoop-
common/CommandsManual.html(Last
accessed on 21 August, 2016 at 11AM)
[15]. R.C. Saritha and Dr. M.Usha Rani “Map
Reduce Text Clustering Using Vector Space
Model”September 2014 International journal
8. Roshani Pardeshi.et.al Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 9(Part-5), September.2016, pp.-18-25
www.ijera.com 25|P a g e
of emerging technology and advanced
engineering, Volume 4,Issue 9.
[16]. Kyung-Rog Kim, Ju-Ho Lee, Jae-Hee Byeon
“Recommender System Using the Movie
Genre similarity in Mobile service” IEEE
2010.