SlideShare a Scribd company logo
1 of 7
Download to read offline
Copyright © 2018 IJECCE, All right reserved
42
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
Overcoming Big Data Mining Challenges for
Revolutionary Breakthroughs in Commerce and
Industry
Dr. Anthony Otuonye I.1
, Dr. Nwokonkwo Obi C.1
, Dr. Eke Felix M.2
1
Department of Information Management Technology,
Federal University of Technology Owerri, Nigeria.
Email: anthony.otuonye@futo.edu.ng, ifeanyiotuonye@yahoo.com, +234(0)8060412674
2
The Library, Federal University of Technology Owerri, Nigeria.
Abstract – Big-data computing is perhaps the biggest
innovation in computing in the last decade. We have only
begun to see its potential to collect, organize, and process data
for profitable business ventures. Data mining is a major
component of the big data technology and the whole essence is
to analytically explore data in search of consistent patterns
and to further validate the findings by applying the detected
patterns to new data sets. Big Data concern large-volume,
complex, growing data sets with multiple, autonomous
sources. This paper explores the major challenges of big data
mining some of which are as a result of the intrinsically
distributed and complex environment and some due to the
large, unstructured and dynamic datasets available for
mining. We discover that the gradual shift towards distributed
complex problem solving environments is now prompting a
range of new data mining research and development
problems. In this paper also, we have proffered solution to the
new challenges of big data mining by a proposition of the
HACE theory to fully harness the potential benefits of the big
data revolution and to trigger a revolutionary breakthrough
in commerce and industry. The research also proposed a
three-tier data mining structure for big data that provides
accurate and relevant social sensing feedback for a better
understanding of our society in real-time. Based on our
observations, we recommend a re-visitation of most of the data
mining techniques in use today and a deployment of
distributed versions of the various data mining models
available in order to meet the new challenges of big data.
Developers should take advantage of available big data
technologies with affordable, open source, and easy-to-deploy
platforms.
Keywords – Big data, Data mining, Autonomous sources,
Distributed database, HACE theory.
I. INTRODUCTION
1.1 ‘Big Data’ Concept, Meaning and Origin
The first academic research paper that carried the words
'Big Data' in its title is a paper written by Diebold in 2000,
but the term 'Big Data' appeared for the first time in 1998 in
a Silicon Graphics (SGI) slide deck by John Mashey with
the title: "Big Data and the Next Wave of InfraStress".
However, the first book to make mention of the term 'Big
Data' is a data mining book written by Weiss and Indrukya
in 1998, and this goes a long way to demonstrate the fact
that data mining was very much relevant to big data in the
very beginning.
From then onwards, many other scholars and academic
researchers began to develop interest in the concept. For
instance, at the KDD “BigMine 12” Workshop in 2002,
startling revelations were brought forth by diligent scholars
about big data and presented amazing facts about internet
usage. Some of the statistics presented include the
following: each day Google has more than 1 billion queries
per day, Twitter has more than 250 million tweets per day,
Facebook has more than 800 million updates per day, and
YouTube has more than 4 billion views per day. The origin
of the term 'Big Data' stems from the simple fact that the
business world and the entire society creates huge amount
of data on a daily basis.
A recent study estimated that every minute, Google
receives over 2 million queries, e-mail users send over 200
million messages, YouTube users upload 48 hours of video,
Facebook users share over 680,000 pieces of content, and
Twitter users generate 100,000 tweets. Besides, media
sharing sites, stock trading sites and news sources
continually pile up more new data throughout the day.
Big Data simply describes the availability and
exponential growth of data, which might be structured,
semi-structured and unstructured in nature. It consists of
billions or trillions of records which might be in terabytes
or petabytes (1024 terabytes) or exabytes (1024 petabytes).
The amount of data produced nowadays in our society can
only be estimated in the order of zettabytes, and it is
estimated that this quantity grows at the rate of about 40
percent every year.
More and more data are going to be generated from
mobile devices and big software companies as Google,
Apple, Facebook, Yahoo, etc, who are now consider finding
useful patterns from such data for steady improvement of
user experience.
Many researchers have defined big data in several ways.
According to [6], big data is defined as the large and ever-
growing and disparate volumes of data which are being
created by people, tools and machines. From a number of
sources including social media, internet-enabled devices
(such as smart phones and tablets), machine data, video and
voice recordings, etc, both structured and unstructured
information are generated on a daily basis. Research has
shown that our society today generates more data in 10
minutes than all that all of humanity has ever created
through to the year 2003.
The Big Data revolution is on and corporate
organizations around the world can leverage on its power
and potentials for breakthrough in their commercial
activities. With proper use of the big data, there is bound to
be a global commercial revitalization and business
breakthroughs for most indigenous corporations. It has been
Copyright © 2018 IJECCE, All right reserved
43
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
proved that big data can provide business owners and
corporate managers with innovative technologies to collect
and analytically process the vast data to derive real-time
business insights that relate to such market forces as
consumers, risk, profit, performance, productivity
management and enhanced shareholder value.
Data mining remains a major aspect of big data analysis.
Though data mining is a very powerful tool in computing,
it faces various challenges many times during its
implementation. Some of the challenges relate to
performance, data, methods and techniques used etc. Other
data mining challenges might include recognition of poor-
quality data, dirty data, missing values, inadequate data
size, and poor representation in data sampling which can be
a great hindrance to the big data revolution.
Mining big data has opened many new challenges and
opportunities. Even though big data bears greater value
(making hidden knowledge available and more valuable
insights), there are tremendous challenges in extracting the
hidden knowledge and insights from big data since the
established process of knowledge discovering and data
mining from conventional datasets was not designed to and
will not work well with big data.
Data mining processes become successful only when
these challenges are identified and successfully overcome
to adequately harness the big data potentials.
1.2 Aim of the Study
In this paper, we shall expose some of the data mining
challenges faced by our big data processing systems, and
some technology and application challenges of big data
systems that can pose a danger to big data diffusion
especially in most developing countries where the
potentials are not being utilized. The challenges of big data
are broad in terms of data access, storage, searching,
sharing, and transfer. Recommendations shall also be made
in this research paper to proffer solution to the mining
challenge, technology and use of application software to
trigger sustainable breakthroughs in commerce and industry
via the big data revolution.
II. RELEVANT CONCEPTS IN BIG DATA
TECHNOLOGY
We begin with a review of some relevant concepts,
including: big data, types of big data and sources, the ‘V’s
of big data phenomenon, data mining, and big data mining.
2.1 Big Data
Big Data is a general term used to describe any large
collection of data sets, so large and from such diverse
sources that it becomes difficult to process them using
conventional data processing applications. The challenges
include analysis, capture, search, sharing, storage, transfer,
revelation, and privacy violations. We are currently living
in the era of big data and cloud computing, with diverse new
challenges and opportunities. Some large business
organizations are beginning to handle data with petabyte-
scale. Recent advances in technology have equipped
individuals and corporations with the ability to easily and
quickly generate tremendous streams of data at any time
and from anywhere using their digital devices; remote
sensors have been ubiquitously installed and utilized to
produce continuous streams of digital data. Massive
amounts of heterogeneous, dynamic, semi-structured and
unstructured data are now being generated from different
sources and applications such as mobile-banking
transactions, online user-generated contents (such as
tweets, blog posts, and videos), online-search log records,
emails, sensor networks, satellite images and others [5].
Both government agencies and corporate business
organizations are beginning to leverage on the power of the
big data technology to improve their administrative and
business skills. For example, governments are now mining
the contents of social media networks and blogs, online-
transactions and other sources of information to identify the
need for government facilities, and to recognize any
suspicious organizational groups. They also mine these
contents to predict future events such threats or promises.
Service providers are beginning to track their customers’
online purchases, in-store, and on-phone, including
customers’ behaviors through recorded streams of online-
clicks, as well as product reviews and ranking in order to
improve their marketing efforts, predict new growth points
of profits, and increase their customer satisfaction.
2.2 Types of Big Data and Sources
There are basically two types of big data which are
usually generated from social media sites, streaming data
from IT systems and connected devices, and data from
government and open data sources.
The two types of big data are structured and unstructured
data. Structured data refers to information with a high
degree of organization, such that inclusion in a relational
database is seamless and readily searchable by simple,
straightforward search engine algorithms or other search
operations; whereas unstructured data is essentially the
opposite and the lack of structure usually makes
compilation a time and energy-consuming task.
Structured data: These are data in the form of words and
numbers that are easily categorized in tabular format for
easy storage and retrieval using relational database systems.
Such data are usually generated from sources such as global
positioning system (GPS) devices, smart phones and
network sensors embedded in electronic devices.
Spreadsheets can be considered as structured data, which
can be quickly scanned for information because it is
properly arranged in a relational database system.
Unstructured data: There are data items that not easily
analyzed using traditional systems because of their inability
to be maintained in a tabular format. It contains more
complex information such as photos and other multimedia
information, consumer comments on products and services,
customer reviews of any commercial websites, and so on.
According to [9], sometimes, unstructured data is not easily
readable.
Email is an example of unstructured data because while
the busy inbox of a corporate human resources manager
might be arranged by date, time or size, it would have been
possible also to arrange it by exact subject and content. But
this is almost impractical since people do not generally
write about only one subject matter even in focused emails.
2.3 The ‘Vs’ of Big Data Phenomenon
Copyright © 2018 IJECCE, All right reserved
44
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
The concept of big data began to gather momentum from
2003 when the five ‘Vs’ of big data was proposed to give
foundational structure to the phenomenon that gave rise to
its current form and definition. According to the
foundational definition in [6], big data concept has the
following five mainstreams: volume, velocity, value,
variety, and veracity.
Volume: Organizations gather their information from
different sources including social media, cell phones,
machine-to-machine (M2M) sensors, credit cards, business
transactions, photographs, videos recordings, and so on. A
vast amount of data is generated each second from these
channels, which have become so large that storing and
analyzing them would definitely constitute a problem,
especially using our traditional database technology.
According to [6], facebook alone generates about 12 billion
messages a day, and over 400 million new pictures are
uploaded every day. Peoples’ comments alone on issues of
social importance are in millions, and the collection and
analysis of such information have now become an
engineering challenge.
Velocity: By velocity, we refer to the speed at which new
data is being generated from various sources such as e-
mails, twitter messages, video clips, and social media
updates. Such information now comes in torrents from all
over the world on a daily basis. The streaming data need to
be processed and analyzed at the same speed and in a timely
manner for it to be of value to business organizations and
the general society. Results of data analysis should equally
be transmitted instantaneously to various users. Credit card
transactions, for instance, need to be checked in seconds for
fraudulent activities. Trading systems also need to analyze
social media networks in seconds to obtain data for proper
decisions to buy or sell shares, and so on.
Variety: Variety refers to different types of data and the
varied formats in which data are presented. Using the
traditional database systems, information is stored as
structured data mostly in numeric data format. But in
today’s society, we receive information mostly as
unstructured text documents, email, video, audio, financial
transactions, and so on. The society no longer make use of
only structured data arranged in columns of names, phone
numbers, and addresses, that fits nicely into relational
database tables. Recent research efforts have shown that
more than 80% of today’s data is unstructured.
Value: Data is only useful if it is turned into value. By
value, we refer to the worth of the data being extracted.
Business owners should not only embark on data gathering
and analysis, but understand the costs and benefits of
collecting and analyzing such data. The benefits to be
derived from such information should exceed the cost of
data gathering and analysis for it to be taken as valuable.
Veracity: Veracity refers to the trustworthiness of the
data. That is, how accurate is the data that have been
gathered from the various sources? Big data analysis tries
to verify the reliability and authenticity of data such as
abbreviations and typos from twitter posts and some
internet contents. It can make comparisons that bring out
the correct and qualitative data sets. Big data technology
also adopts new approaches that link, match, cleanse and
transform data sets coming from various systems.
Fig. 2.1. Five ‘Vs’ of Big data
2.4 Data Mining
Data mining (also called knowledge discovery) is the
process of analyzing data from different perspectives and
summarizing it into useful information, such that can be
used to increase revenue, cuts costs, or both [5]. Technically
speaking, data mining is the process of finding correlations
or patterns among dozens of fields in large relational
database. It is used for the following specific classes of
activities: Classification, Estimation, Prediction,
Association rules, Clustering, and Description.
Classification
Classification is a process of generalizing the data
according to different instances. Classification consists of
examining the features of a newly presented object and
Copyright © 2018 IJECCE, All right reserved
45
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
assigning to it a predefined class. The classification task is
characterized by the well-defined classes, and a training set
consisting of reclassified examples. Some of the major
kinds of classification algorithms in data mining include:
Decision tree, k-nearest neighbor classifier, Naive Bayes,
Apriori and the AdaBoost.
Estimation
Given some input data, we use estimation to come up
with a value for some unknown continuous variables such
as income, height or credit card balance. Estimation
generally deals with continuously valued outcomes.
Prediction
Prediction may come in form of a simple statement that
shows some expected outcome. It is a statement about the
way things will happen in the future, often but not always
based on experience or knowledge.
Association Rules
An association rule is a rule which implies certain
association relationships among a set of objects (such as
“occur together” or “one implies the other”) in a database.
Clustering
Clustering deals with the problem of locating a structure
in a collection of unlabeled data. Generally, Knowledge
Discovery (KDD) is an activity that unveils hidden
knowledge and insights from a large volume of data, which
involves data mining as its core and the most challenging
step. Typically, data mining uncovers interesting patterns
and relationships hidden in a large volume of raw data, and
the results discovered may be used to make valuable
predictions or future observations in the real world. Data
mining has been used by a wide range of applications such
as business, medicine, science and engineering and has led
to numerous beneficial services to many walks of real
businesses [10].
2.5 Data Mining for Big the Data
There is a serious mismatch between the demands of the
big data management and the capabilities of current Data
Base Management Systems (DBMS). Each of the V’s of big
data (volume, variety, velocity, etc) clearly implies one
distinct aspect of critical deficiencies of today’s DBMSs.
Gigantic volume equally require great scalability and
massive parallelism that are beyond the capability of
current systems; the great variety of data types of big data
is particularly unfit for current database systems due to the
restriction of the closed processing architecture of the
DBMS; the speed/velocity request of big data processing
demands commensurate real-time efficiency which again is
far beyond the capacity of current DBMSs. Another
limitation of current DBMSs is that it typically require that
data be first loaded into their storage systems, which also
enforces a uniform format before any access/processing is
allowed. Confronted with the huge volume of big data, the
importing/loading stage could take hours, days, or even
months and this will cause a substantial delay in the
availability of the DBMSs) [5].
III. CRITICAL CHALLENGES OF DATA MINING
IN BIG DATA TECHNOLOGY
3.1 Overview
Accessible information is critical for a productive
economy and a robust society. In the past, data mining
techniques have been used to discover unknown patterns
and relationships of interest from structured, homogeneous,
and small datasets. Today, we are living in the big data era
where enormous amounts of heterogeneous, semi-
structured and unstructured data are continually generated
at unprecedented speed and scale. Big data discloses the
limitations of existing data mining techniques, which has
resulted in a number of new challenges related to big data
mining. The real-world data is heterogeneous, incomplete
and noisy. Data in large quantities normally will be
inaccurate or unreliable. These problems could be due to
errors from instruments that measure the data or because of
human errors. For data to have any value it needs to be
discoverable, accessible and usable, and the significance of
these requirements only increases the need for an effective
and efficient data mining of the big data. Big data as an
emerging trend and the need for big data mining is rising in
all science and engineering domains. With effective big
data mining, it will be possible to provide most relevant and
most accurate social sensing feedback for a better
understanding of our society in real time.
In this section, we shall examine in detail, the perceived
challenges of big data mining in order to seek ways of
overcoming and trigger revolutionary breakthroughs in
business and commerce.
3.2 Data Mining Challenges due to Intrinsically
Distributed Complex Environment
The shift towards intrinsically distributed complex
problem solving environments is prompting a range of new
data mining research and development problems, which can
be classified into the following broad challenges:
i. Distributed data
Data for mining is usually stored in distributed computing
environments and on heterogeneous platforms. For some
technical and organizational reasons, it is impossible to
gather all such data into a centralized location for
processing. Consequently, there is need to develop better
algorithms, tools, and services that facilitate the mining of
distributed data.
ii. Distributed operations
In order to facilitate seamless integration of various
computing resources into distributed data mining systems
for complex problem solving, new algorithms, tools, and
other IT infrastructure need to be developed.
iii.Huge data
There is continuous increase in the size of big data. Most
machine learning algorithms have been created for handling
only a small training set, for instance a few hundred
examples. In order to use similar techniques in databases
thousands of times bigger, much care must be taken. Having
very much data is advantageous since they probably will
show relations that really exist, but the number of possible
descriptions of such a dataset is enormous. Some possible
Copyright © 2018 IJECCE, All right reserved
46
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
ways of coping with this problem, are to design algorithms
with lower complexity and to use heuristics to find the best
classification rules. Simply using a faster computer is not
always a good solution. There is need to develop algorithms
for mining large, massive and heterogeneous data sets,
which increases continuously, with rapid growth of
complex data types, increasingly complex data sources,
structures, and types. Mining of such data will only require
the development of new methodologies, algorithms, and
tools. Ideally speaking however, sampling and
parallelization are effective tools to attack this scalability
problem.
iv. Data privacy, security, and governance
Automated data mining in distributed environments
raises serious issues in terms of data privacy, security, and
governance. There is need to develop certain grid-based
data mining technologies to address these issues.
v. User-friendliness
A good software system must ultimately hide its
technological complexity from the user. To facilitate this,
new user-friendly interfaces, software tools, and IT
infrastructure will be needed in the areas of grid-supported
workflow management, resource identification, allocation,
and scheduling.
vi. Data Mining Challenges due to Large,
Unstructured, and Changing Datasets
A number of data mining challenges also exist that are
basically as a result of the large, unstructured and ever-
changing datasets. Some of these challenges include the
following:
i. Data integrity
Another challenge with distributed data management is
that of data integrity. Data analysis can only be as good as
the data that is being analyzed. A major implementation
challenge is integrating conflicting or redundant data from
different sources. If for instance, a bank may maintains
credit card accounts on different databases, the addresses
(or even the names) of a single cardholder may be different
in each database. There is need to use software that
translates data from one system to another and select the
address most recently entered.
ii.Interpretation of results
Data mining output may require experts to correctly
interpret the results, which might otherwise be meaningless
to the average database user.
iii. Multimedia data
Most previous data mining algorithms targeted only the
traditional data types (numeric, character, text, etc.). The
use of multimedia data such as is found in GIS databases
complicates or invalidates most of the traditional
algorithms.
iv. High dimensionality
A conventional database schema will usually comprise of
many different attributes. The fact remains that not all the
attributes may be needed to solve a given data mining
problem. The use of other attributes may only increase the
overall complexity and decrease the efficiency of an
algorithm. In fact, involving many attributes (dimensions)
will portend difficulty in determining which ones should be
used. One solution to this high dimensionality problem is to
reduce the number of attributes, which is known as
dimensionality reduction. However determining which
attributes are not needed may not always be that easy.
v.Noisy data
In a large database, many of the attribute values will
usually be invalid, inexact, or incorrect. The error may be
due to erroneous instruments used in measuring some
properties, or human error when registering it. There are
two forms of noise in the data, which are corrupted values,
and missing attribute value.
Corrupted Values: Sometimes some of the values in the
training set are altered from what they should have been.
This may result in conflict between one or more tuples in
the database and the rules already established. These
extreme values are usually regarded by the system as noise,
which are consequently ignored. The problem then will be
how to know if these extreme values are correct or not.
Missing Attribute Values: One or more of the attribute
values may be missing completely. When an attribute value
is missing for an object during classification, the system
may check all matching rules and calculate the most
probable classification. It is however advised that all
incorrect values should be corrected before running data
mining applications.
vi. Irrelevant data
Some attributes in the database might not be of interest to
the data mining task being carried out.
vii. Missing data
During the pre-processing phase of knowledge discovery
from databases, missing data may be replaced with
estimates. This approach can however introduce errors and
lead to invalid results from the overall data mining activity.
viii.Dynamic data
Big data is definitely not static. Most data mining
algorithms, however, do assume a static database. Using
such data mining algorithms therefore will require a
complete re-run of the program anytime the database
changes. But the truth is that most big data usually change
continually. There is therefore a need to develop data
mining algorithms with rules that reflect the content of the
database at all times in order to make the best possible
classification. Many existing data mining systems require
that all the training examples are given at once. If something
is changed at a later time, the whole learning process may
have to be conducted again. The challenge now is how data
mining systems can avoid this, and instead change its
current rules according to updates performed.
IV. OVERCOMING DATA MINING CHALLENGES
IN BIG DATA
4.1 HACE Theorem: Modeling Big Data
Characteristics
The HACE theorem is a fundamental theorem that
uniquely models big data characteristics. Big Data has the
characteristics of being heterogeneous, of very large
volume, autonomous sources with distributed and
decentralized control, and a complex and evolving
relationships among data. These characteristics pose
Copyright © 2018 IJECCE, All right reserved
47
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
enormous challenge in determining useful knowledge and
information from the big data.
To explain the characteristics of big data, the native myth
that tells the story of a number of blind men trying to size
up an elephant, usually come into mind. The big elephant in
will represent the Big Data in our context. The purpose of
each blind man is to draw useful conclusion regarding the
elephant (of course which will depend on the part of the
animal he touched). Since the knowledge extracted from the
experiment with the blind men will be according to the part
of the information he collected, it is expected that the blind
men will each conclude independently and differently that
the elephant “feels” like a rope, a stone, a stick, a wall, a
hose, and so on.
To make the problem even more complex, assume that:
i. The elephant is increasing very quickly in size and that
the posture is constantly changing
ii. Each blind man has his own information sources,
possible inaccurate and unreliable that give him varying
knowledge about what the elephant looks like (example,
one blind man may share his own inaccurate view of the
elephant to his friend)
This information sharing will definitely make changes in
the thinking of each blind man.
Exploring information from Big Data is equivalent to the
scenario illustrated above. It will involve merging or
integrating heterogeneous information from different
sources (just like the blind men) to arrive at the best possible
and accurate knowledge regarding the information domain.
This will certainly not be as easy as enquiring from each
blind man about the elephant or drawing one single picture
of the elephant from a joint opinion of the blind men. This
is because each data source may express a different
language, and may even have confidentiality concerns
about the message they measured based on their own
information exchange procedure.
HACE theorem therefore suggests the following key
characteristics of Big Data:
a. Huge with Heterogeneous and Diverse Data
Sources
Big data is heterogeneous because different data
collectors make use of their own big data protocols and
schema for knowledge recording. Therefore, the nature of
information gathered even from the same sources will vary
based on the application and procedure of collection. This
will end up in diversities of knowledge representation.
b. Autonomous Sources and Decentralized Control
With big data, each data source is distinct and
autonomous with a distributed and decentralized control.
Therefore in big data, there is no centralized control in the
generation and collection of information. This setting is
similar to the World Wide Web (WWW) where the function
of each web server does not depend on the other servers.
c. Complex Data Relationships and Evolving
Knowledge Associations
Originally, analyzing information using centralized
information systems usually tries to discover the features
that best represent every observation. That is, each object is
treated as an independent entity without considering any
other social connection with other objects within the
domain or outside. Meanwhile, relationships and
correlation are the most important factors of the human
society. In our dynamic society, individuals must be
represented alongside their social ties and connections
which also evolve depending on certain temporal, spatial,
and other factors. Example, the relationship between two or
more facebook friends represents complex relationship
because new friends are added every day. To maintain the
relationship among these friends will therefore pose a huge
challenge for developers. Other examples of complex data
types are time-series information, maps, videos, and
images.
4.2 The Hadoop: Effective HACE Theorem
Application
As earlier illustrated, HACE theorem models the Big
Data to include: Huge, with heterogeneous and diverse data
sources that represents diversities of knowledge
representation, autonomous sources and decentralized
control similar to the World Wide Web (WWW) where the
function of each web server does not depend on the other
servers, and complex data relationships with evolving
knowledge associations.
A popular open source implementation of the HACE
theorem is Apache Hadoop. Hadoop has the capacity to link
a number of relevant, disparate datasets for analysis in order
to reveal new patterns, trends and insights.
4.3 Three-Tier Structure for Big Data Mining Models
We propose a conceptual framework for big data mining
that follows a three-tier structure as shown in figure 4.1
below:
Fig. 4.1. Three-tier structure in big data mining
Copyright © 2018 IJECCE, All right reserved
48
International Journal of Electronics Communication and Computer Engineering
Volume 9, Issue 1, ISSN (Online): 2249–071X
4.4 Interpretation
Tier 1:
Considering figure 4.1, Tier 1 concentrates on accessing
big datasets and performing arithmetic operations on them.
Big data cannot practically be stored in a single location.
They storages in diverse locations will also increase
Therefore, effective computing platform will have to be in
place to take up the distributed large scale datasets and
perform arithmetic operations on them. In order to achieve
such common operations in a distributed computing
environment, parallel computing architecture must be
employed. The major challenge at Tier 1 is that a single
personal computer cannot possibly handle the big data
mining because of the large quantity of data involved. To
overcome this challenge at Tier 1, the concept of data
distribution has to be used. For processing of big data in a
distributed environment, we propose the adoption of such
parallel programming models like the Hadoop’s
MapReduce technique.
Tier 2:
Tier 2 focuses on the semantics and domain knowledge
for the different Big Data applications. Such information
will be of benefit to the data mining process and Tier 1 and
to the data mining algorithms at Tier 3 by adding certain
technical barriers and checks and balances to the process.
Addition of technical barriers is necessary because
information sharing and data privacy mechanisms between
data producers and data consumers can be different for
various domain applications. [9].
Tier 3:
Algorithm Designs take place at Tier 3. Big data mining
algorithms will help in tackling the difficulties raised by the
Big Data volumes, complexity, dynamic data
characteristics and distributed data. The algorithm at Tier 3
will contain three iterative stages: The first iteration is a pre-
processing of all uncertain, sparse, heterogeneous, and
multisource data. The second is the mining of dynamic and
complex data after the pre-processing operation. Thirdly the
global knowledge received by local learning matched with
all relevant information is feedback into the pre-processing
stage, while the model and parameters are adjusted
according to the feedback.
V. CONCLUSION AND RECOMMENDATIONS
5.1 Conclusion
In this research paper, we have X-rayed some major
challenges of big data mining some of which are as a result
of the intrinsically distributed and complex environment,
while some are due to the large, unstructured and dynamic
datasets. We discover that the gradual shift towards
distributed complex problem solving environments is now
prompting a range of new data mining research and
development problems. We have equally tried to proffer
solution to the new challenges of big data mining by a
proposition of the HACE theory to fully harness the
potential benefits of the big data revolution and to trigger a
revolutionary breakthrough in commerce and industry. The
research also proposed a three-tier data mining structure for
big data that provides accurate and relevant social sensing
feedback for a better understanding of our society in real-
time.
5.2 Recommendations
Based on our study and observations so far made, we
recommend that analysts should re-visit most of the data
mining techniques in use today and develop distributed
versions of the various data mining models available in
order to meet the new challenges of the big data. Developers
should take advantage of available big data technologies
with affordable, open source, and easy-to-deploy platforms.
REFERENCES
[1] Ahmed R. and Karypis G., “Algorithms for Mining the Evolution
of Conserved Relational States in Dynamic Networks,” Knowledge
and Information Systems, vol. 33, no. 3, pp. 603-630, Dec. 2012.
[2] Berkovich, S., Liao, D.: On Clusterization of big data Streams. In:
3rd
International Conference on Computing for Geospatial Research
and Applications, article no. 26. ACM Press, New York (2012).
[3] Deepak S. Tamhane, Sultana N. Sayyad (2015): Big Data Analysis
using HACE Theorem, International Journal of Advanced Research
in Computer Engineering & Technology (IJARCET) Volume 4
Issue 1, January 2015
[4] Department of Finance and Deregulation Australian Government
Big Data Strategy-Issue Paper March 2013
[5] Dunren Che, Mejdl Safran and Zhiyong Peng (2015): From Big
Data to Big Data Mining: Challenges, Issues, and Opportunities,
Department of Computer Science, Southern Illinois University
Carbondale, Illinois 62901, USA
[6] EYGM Limited, (2014): Big data: changing the way businesses
compete and operate retrieved from (www.ey.com/GRCinsignts).
[7] Longbing Cao, (2012): Combined Mining: Analyzing Object and
Pattern Relations for Discovering Actionable Complex Patterns”,
sponsored by Australian Research Council discovery Grants, 2012
[8] Prema Gadling and Mahip Bartere (2013): Implementing HACE
Theorem for Big Data Processing, International Journal of Science
and Research (IJSR) ISSN (Online): 2319-7064
[9] Rajneesh kumar Pandey, Uday A. Mande, “Survey on data mining
and Information security in Big Data using HACE theorem,”
International Engineering Research Journal (IERJ), vol. 1, issue 11
, 2016, ISSN 2395 -1621
[10] Sameeruddin Khan, Subbarao G. and REDDY G.V. (2016): Hace
Theorem based Data Mining using Big Data, Research Inventy:
International Journal of Engineering and Science Vol. 6, Issue
5(May 2016), PP -83-87
[11] Sagiroglu, S.; Sinanc, D. (2013) "Big data: A review,"
Collaboration Technologies and Systems (CTS), International
Conference on, vol., no., pp.42, 47, 20-24 May 2013
[12] Wei Fan and Albert Bifet “Mining Big Data: Current Status and
Forecast to the Future”, Vol 14, Issue 2, 2013.
AUTHOR’S PROFILE
Dr. Anthony Otuonye
(anthony.otuonye@futo.edu.ng)
Dr. Anthony Otuonye obtained a Bachelor of Science
Degree in Computer Science from Imo State University
Owerri, Nigeria, and a Master of Science Degree in
Information Technology from the Federal University of
Technology Owerri. He also holds a Ph.D. in Software
Engineering from Nnamdi Azikiwe University Awka, Nigeria. Currently,
Dr. Anthony is a Lecturer and Researcher with the Department of
Information Management Technology, Federal University of Technology
Owerri, Nigeria. He is a member of some professional bodies in his field
and has authored several research articles most of which have been
published in renowned international journals. He is married to Mrs. Edit
Otuonye and, with their children, they live in Owerri, Imo State of Nigeria.

More Related Content

What's hot

Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor networkparry prabhu
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...IJERDJOURNAL
 
Big Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMUBig Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMUEdison Lim Jun Hao
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAkshata Humbe
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applicationsIJARIIT
 
Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)Sonu Gupta
 
Implementation of application for huge data file transfer
Implementation of application for huge data file transferImplementation of application for huge data file transfer
Implementation of application for huge data file transferijwmn
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesEditor IJMTER
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Idc big data whitepaper_final
Idc big data whitepaper_finalIdc big data whitepaper_final
Idc big data whitepaper_finalOsman Circi
 
Story of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial InstitutionsStory of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial Institutionsijtsrd
 
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata PrivacyTwo-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacydbpublications
 
Big Data & Analytics for Government - Case Studies
Big Data & Analytics for Government - Case StudiesBig Data & Analytics for Government - Case Studies
Big Data & Analytics for Government - Case StudiesJohn Palfreyman
 
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...Rida Qayyum
 

What's hot (20)

Big Data Challenges faced by Organizations
Big Data Challenges faced by OrganizationsBig Data Challenges faced by Organizations
Big Data Challenges faced by Organizations
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
 
Big Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMUBig Data - Big Deal? - Edison's Academic Paper in SMU
Big Data - Big Deal? - Edison's Academic Paper in SMU
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applications
 
Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
 
Big Data Paper
Big Data PaperBig Data Paper
Big Data Paper
 
Implementation of application for huge data file transfer
Implementation of application for huge data file transferImplementation of application for huge data file transfer
Implementation of application for huge data file transfer
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
 
Privacy in the Age of Big Data
Privacy in the Age of Big DataPrivacy in the Age of Big Data
Privacy in the Age of Big Data
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big data
Big data Big data
Big data
 
Idc big data whitepaper_final
Idc big data whitepaper_finalIdc big data whitepaper_final
Idc big data whitepaper_final
 
Big Data Ethics
Big Data EthicsBig Data Ethics
Big Data Ethics
 
Story of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial InstitutionsStory of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial Institutions
 
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata PrivacyTwo-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
 
Big Data & Analytics for Government - Case Studies
Big Data & Analytics for Government - Case StudiesBig Data & Analytics for Government - Case Studies
Big Data & Analytics for Government - Case Studies
 
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
 

Similar to Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Commerce and Industry

A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremAnthonyOtuonye
 
Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .
Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .
Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .eraser Juan José Calderón
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIIJCSEA Journal
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIIJCSEA Journal
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIIJCSEA Journal
 
Data Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyData Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyEditor IJCATR
 
Isolating values from big data with the help of four v’s
Isolating values from big data with the help of four v’sIsolating values from big data with the help of four v’s
Isolating values from big data with the help of four v’seSAT Journals
 
IRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial DomainIRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial DomainIRJET Journal
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
A Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven SocietyA Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven SocietyAnthonyOtuonye
 
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...ijistjournal
 
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsWhitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsHappiest Minds Technologies
 
Age Friendly Economy - Introduction to Big Data
Age Friendly Economy - Introduction to Big DataAge Friendly Economy - Introduction to Big Data
Age Friendly Economy - Introduction to Big DataAgeFriendlyEconomy
 
Process oriented architecture for digital transformation 2015
Process oriented architecture for digital transformation   2015Process oriented architecture for digital transformation   2015
Process oriented architecture for digital transformation 2015Vinay Mummigatti
 

Similar to Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Commerce and Industry (20)

A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE Theorem
 
Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .
Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .
Convergence of AI, IoT, Big Data and Blockchain: A Review. Kefa Rabah .
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AI
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AI
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AI
 
Data Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A SurveyData Mining in the World of BIG Data-A Survey
Data Mining in the World of BIG Data-A Survey
 
Big data
Big dataBig data
Big data
 
Isolating values from big data with the help of four v’s
Isolating values from big data with the help of four v’sIsolating values from big data with the help of four v’s
Isolating values from big data with the help of four v’s
 
Unit III.pdf
Unit III.pdfUnit III.pdf
Unit III.pdf
 
IRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial DomainIRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial Domain
 
Big Data
Big DataBig Data
Big Data
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
A Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven SocietyA Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven Society
 
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
 
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsWhitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
 
Age Friendly Economy - Introduction to Big Data
Age Friendly Economy - Introduction to Big DataAge Friendly Economy - Introduction to Big Data
Age Friendly Economy - Introduction to Big Data
 
08258937bigdata en la industria 4.0
08258937bigdata en la industria 4.008258937bigdata en la industria 4.0
08258937bigdata en la industria 4.0
 
Big data
Big dataBig data
Big data
 
Process oriented architecture for digital transformation 2015
Process oriented architecture for digital transformation   2015Process oriented architecture for digital transformation   2015
Process oriented architecture for digital transformation 2015
 

More from AnthonyOtuonye

Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision
Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision
Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision AnthonyOtuonye
 
Improving university education in nigeria through mobile academic directory
Improving university education in nigeria through mobile academic directoryImproving university education in nigeria through mobile academic directory
Improving university education in nigeria through mobile academic directoryAnthonyOtuonye
 
Framework for mobile application development using J2ME and UML
Framework for mobile application development using J2ME and UMLFramework for mobile application development using J2ME and UML
Framework for mobile application development using J2ME and UMLAnthonyOtuonye
 
Mitigating corruption in public trust through e-governance
Mitigating corruption in public trust through e-governanceMitigating corruption in public trust through e-governance
Mitigating corruption in public trust through e-governanceAnthonyOtuonye
 
E-gov Model for Imo State Nigeria
E-gov Model for Imo State NigeriaE-gov Model for Imo State Nigeria
E-gov Model for Imo State NigeriaAnthonyOtuonye
 
Deploying content management system to enhance state governance
Deploying content management system to enhance state governanceDeploying content management system to enhance state governance
Deploying content management system to enhance state governanceAnthonyOtuonye
 
Bringing E gov Reforms in Africa through the Content Management System
Bringing E gov Reforms in Africa through the Content Management SystemBringing E gov Reforms in Africa through the Content Management System
Bringing E gov Reforms in Africa through the Content Management SystemAnthonyOtuonye
 
Improving quality of assessment for university academic programmes
Improving quality of assessment for university academic programmesImproving quality of assessment for university academic programmes
Improving quality of assessment for university academic programmesAnthonyOtuonye
 
Design of Improved Programme Assessment Model for Nigerian Universities
Design of Improved Programme Assessment Model for Nigerian UniversitiesDesign of Improved Programme Assessment Model for Nigerian Universities
Design of Improved Programme Assessment Model for Nigerian UniversitiesAnthonyOtuonye
 
Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...
Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...
Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...AnthonyOtuonye
 
Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...
Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...
Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...AnthonyOtuonye
 
Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...
Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...
Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...AnthonyOtuonye
 
Using ICT Policy Framework as Panacea for Economic Recession and Instability ...
Using ICT Policy Framework as Panacea for Economic Recession and Instability ...Using ICT Policy Framework as Panacea for Economic Recession and Instability ...
Using ICT Policy Framework as Panacea for Economic Recession and Instability ...AnthonyOtuonye
 
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...AnthonyOtuonye
 
Need to Implement ICT-based Business Policies for Sustainable Economic Growth...
Need to Implement ICT-based Business Policies for Sustainable Economic Growth...Need to Implement ICT-based Business Policies for Sustainable Economic Growth...
Need to Implement ICT-based Business Policies for Sustainable Economic Growth...AnthonyOtuonye
 
A multi-factor Authentication System to Mitigate Student Impersonation in Ter...
A multi-factor Authentication System to Mitigate Student Impersonation in Ter...A multi-factor Authentication System to Mitigate Student Impersonation in Ter...
A multi-factor Authentication System to Mitigate Student Impersonation in Ter...AnthonyOtuonye
 
Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...
Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...
Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...AnthonyOtuonye
 
Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...
Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...
Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...AnthonyOtuonye
 

More from AnthonyOtuonye (18)

Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision
Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision
Modeling Embedded Software for Mobile Devices to Actualize Nigeria's Vision
 
Improving university education in nigeria through mobile academic directory
Improving university education in nigeria through mobile academic directoryImproving university education in nigeria through mobile academic directory
Improving university education in nigeria through mobile academic directory
 
Framework for mobile application development using J2ME and UML
Framework for mobile application development using J2ME and UMLFramework for mobile application development using J2ME and UML
Framework for mobile application development using J2ME and UML
 
Mitigating corruption in public trust through e-governance
Mitigating corruption in public trust through e-governanceMitigating corruption in public trust through e-governance
Mitigating corruption in public trust through e-governance
 
E-gov Model for Imo State Nigeria
E-gov Model for Imo State NigeriaE-gov Model for Imo State Nigeria
E-gov Model for Imo State Nigeria
 
Deploying content management system to enhance state governance
Deploying content management system to enhance state governanceDeploying content management system to enhance state governance
Deploying content management system to enhance state governance
 
Bringing E gov Reforms in Africa through the Content Management System
Bringing E gov Reforms in Africa through the Content Management SystemBringing E gov Reforms in Africa through the Content Management System
Bringing E gov Reforms in Africa through the Content Management System
 
Improving quality of assessment for university academic programmes
Improving quality of assessment for university academic programmesImproving quality of assessment for university academic programmes
Improving quality of assessment for university academic programmes
 
Design of Improved Programme Assessment Model for Nigerian Universities
Design of Improved Programme Assessment Model for Nigerian UniversitiesDesign of Improved Programme Assessment Model for Nigerian Universities
Design of Improved Programme Assessment Model for Nigerian Universities
 
Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...
Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...
Emprical Study on Quality of Academic Programme Evaluation Framework: A step ...
 
Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...
Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...
Impact of Software Development Outcoursing on Growth of the IC Sector in Deve...
 
Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...
Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...
Remedy to economic recesion in nigeria: an ict-driven model for sustainable e...
 
Using ICT Policy Framework as Panacea for Economic Recession and Instability ...
Using ICT Policy Framework as Panacea for Economic Recession and Instability ...Using ICT Policy Framework as Panacea for Economic Recession and Instability ...
Using ICT Policy Framework as Panacea for Economic Recession and Instability ...
 
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...An Analysis of Big Data Computing for Efficiency of Business Operations Among...
An Analysis of Big Data Computing for Efficiency of Business Operations Among...
 
Need to Implement ICT-based Business Policies for Sustainable Economic Growth...
Need to Implement ICT-based Business Policies for Sustainable Economic Growth...Need to Implement ICT-based Business Policies for Sustainable Economic Growth...
Need to Implement ICT-based Business Policies for Sustainable Economic Growth...
 
A multi-factor Authentication System to Mitigate Student Impersonation in Ter...
A multi-factor Authentication System to Mitigate Student Impersonation in Ter...A multi-factor Authentication System to Mitigate Student Impersonation in Ter...
A multi-factor Authentication System to Mitigate Student Impersonation in Ter...
 
Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...
Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...
Development of Electronic Bank Deposi and Withdrawal System Using Quick Respo...
 
Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...
Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...
Ehanced Business Marketing For Small Scale Enterprises Via the Quick Response...
 

Recently uploaded

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 

Recently uploaded (20)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 

Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Commerce and Industry

  • 1. Copyright © 2018 IJECCE, All right reserved 42 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X Overcoming Big Data Mining Challenges for Revolutionary Breakthroughs in Commerce and Industry Dr. Anthony Otuonye I.1 , Dr. Nwokonkwo Obi C.1 , Dr. Eke Felix M.2 1 Department of Information Management Technology, Federal University of Technology Owerri, Nigeria. Email: anthony.otuonye@futo.edu.ng, ifeanyiotuonye@yahoo.com, +234(0)8060412674 2 The Library, Federal University of Technology Owerri, Nigeria. Abstract – Big-data computing is perhaps the biggest innovation in computing in the last decade. We have only begun to see its potential to collect, organize, and process data for profitable business ventures. Data mining is a major component of the big data technology and the whole essence is to analytically explore data in search of consistent patterns and to further validate the findings by applying the detected patterns to new data sets. Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. This paper explores the major challenges of big data mining some of which are as a result of the intrinsically distributed and complex environment and some due to the large, unstructured and dynamic datasets available for mining. We discover that the gradual shift towards distributed complex problem solving environments is now prompting a range of new data mining research and development problems. In this paper also, we have proffered solution to the new challenges of big data mining by a proposition of the HACE theory to fully harness the potential benefits of the big data revolution and to trigger a revolutionary breakthrough in commerce and industry. The research also proposed a three-tier data mining structure for big data that provides accurate and relevant social sensing feedback for a better understanding of our society in real-time. Based on our observations, we recommend a re-visitation of most of the data mining techniques in use today and a deployment of distributed versions of the various data mining models available in order to meet the new challenges of big data. Developers should take advantage of available big data technologies with affordable, open source, and easy-to-deploy platforms. Keywords – Big data, Data mining, Autonomous sources, Distributed database, HACE theory. I. INTRODUCTION 1.1 ‘Big Data’ Concept, Meaning and Origin The first academic research paper that carried the words 'Big Data' in its title is a paper written by Diebold in 2000, but the term 'Big Data' appeared for the first time in 1998 in a Silicon Graphics (SGI) slide deck by John Mashey with the title: "Big Data and the Next Wave of InfraStress". However, the first book to make mention of the term 'Big Data' is a data mining book written by Weiss and Indrukya in 1998, and this goes a long way to demonstrate the fact that data mining was very much relevant to big data in the very beginning. From then onwards, many other scholars and academic researchers began to develop interest in the concept. For instance, at the KDD “BigMine 12” Workshop in 2002, startling revelations were brought forth by diligent scholars about big data and presented amazing facts about internet usage. Some of the statistics presented include the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 million tweets per day, Facebook has more than 800 million updates per day, and YouTube has more than 4 billion views per day. The origin of the term 'Big Data' stems from the simple fact that the business world and the entire society creates huge amount of data on a daily basis. A recent study estimated that every minute, Google receives over 2 million queries, e-mail users send over 200 million messages, YouTube users upload 48 hours of video, Facebook users share over 680,000 pieces of content, and Twitter users generate 100,000 tweets. Besides, media sharing sites, stock trading sites and news sources continually pile up more new data throughout the day. Big Data simply describes the availability and exponential growth of data, which might be structured, semi-structured and unstructured in nature. It consists of billions or trillions of records which might be in terabytes or petabytes (1024 terabytes) or exabytes (1024 petabytes). The amount of data produced nowadays in our society can only be estimated in the order of zettabytes, and it is estimated that this quantity grows at the rate of about 40 percent every year. More and more data are going to be generated from mobile devices and big software companies as Google, Apple, Facebook, Yahoo, etc, who are now consider finding useful patterns from such data for steady improvement of user experience. Many researchers have defined big data in several ways. According to [6], big data is defined as the large and ever- growing and disparate volumes of data which are being created by people, tools and machines. From a number of sources including social media, internet-enabled devices (such as smart phones and tablets), machine data, video and voice recordings, etc, both structured and unstructured information are generated on a daily basis. Research has shown that our society today generates more data in 10 minutes than all that all of humanity has ever created through to the year 2003. The Big Data revolution is on and corporate organizations around the world can leverage on its power and potentials for breakthrough in their commercial activities. With proper use of the big data, there is bound to be a global commercial revitalization and business breakthroughs for most indigenous corporations. It has been
  • 2. Copyright © 2018 IJECCE, All right reserved 43 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X proved that big data can provide business owners and corporate managers with innovative technologies to collect and analytically process the vast data to derive real-time business insights that relate to such market forces as consumers, risk, profit, performance, productivity management and enhanced shareholder value. Data mining remains a major aspect of big data analysis. Though data mining is a very powerful tool in computing, it faces various challenges many times during its implementation. Some of the challenges relate to performance, data, methods and techniques used etc. Other data mining challenges might include recognition of poor- quality data, dirty data, missing values, inadequate data size, and poor representation in data sampling which can be a great hindrance to the big data revolution. Mining big data has opened many new challenges and opportunities. Even though big data bears greater value (making hidden knowledge available and more valuable insights), there are tremendous challenges in extracting the hidden knowledge and insights from big data since the established process of knowledge discovering and data mining from conventional datasets was not designed to and will not work well with big data. Data mining processes become successful only when these challenges are identified and successfully overcome to adequately harness the big data potentials. 1.2 Aim of the Study In this paper, we shall expose some of the data mining challenges faced by our big data processing systems, and some technology and application challenges of big data systems that can pose a danger to big data diffusion especially in most developing countries where the potentials are not being utilized. The challenges of big data are broad in terms of data access, storage, searching, sharing, and transfer. Recommendations shall also be made in this research paper to proffer solution to the mining challenge, technology and use of application software to trigger sustainable breakthroughs in commerce and industry via the big data revolution. II. RELEVANT CONCEPTS IN BIG DATA TECHNOLOGY We begin with a review of some relevant concepts, including: big data, types of big data and sources, the ‘V’s of big data phenomenon, data mining, and big data mining. 2.1 Big Data Big Data is a general term used to describe any large collection of data sets, so large and from such diverse sources that it becomes difficult to process them using conventional data processing applications. The challenges include analysis, capture, search, sharing, storage, transfer, revelation, and privacy violations. We are currently living in the era of big data and cloud computing, with diverse new challenges and opportunities. Some large business organizations are beginning to handle data with petabyte- scale. Recent advances in technology have equipped individuals and corporations with the ability to easily and quickly generate tremendous streams of data at any time and from anywhere using their digital devices; remote sensors have been ubiquitously installed and utilized to produce continuous streams of digital data. Massive amounts of heterogeneous, dynamic, semi-structured and unstructured data are now being generated from different sources and applications such as mobile-banking transactions, online user-generated contents (such as tweets, blog posts, and videos), online-search log records, emails, sensor networks, satellite images and others [5]. Both government agencies and corporate business organizations are beginning to leverage on the power of the big data technology to improve their administrative and business skills. For example, governments are now mining the contents of social media networks and blogs, online- transactions and other sources of information to identify the need for government facilities, and to recognize any suspicious organizational groups. They also mine these contents to predict future events such threats or promises. Service providers are beginning to track their customers’ online purchases, in-store, and on-phone, including customers’ behaviors through recorded streams of online- clicks, as well as product reviews and ranking in order to improve their marketing efforts, predict new growth points of profits, and increase their customer satisfaction. 2.2 Types of Big Data and Sources There are basically two types of big data which are usually generated from social media sites, streaming data from IT systems and connected devices, and data from government and open data sources. The two types of big data are structured and unstructured data. Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; whereas unstructured data is essentially the opposite and the lack of structure usually makes compilation a time and energy-consuming task. Structured data: These are data in the form of words and numbers that are easily categorized in tabular format for easy storage and retrieval using relational database systems. Such data are usually generated from sources such as global positioning system (GPS) devices, smart phones and network sensors embedded in electronic devices. Spreadsheets can be considered as structured data, which can be quickly scanned for information because it is properly arranged in a relational database system. Unstructured data: There are data items that not easily analyzed using traditional systems because of their inability to be maintained in a tabular format. It contains more complex information such as photos and other multimedia information, consumer comments on products and services, customer reviews of any commercial websites, and so on. According to [9], sometimes, unstructured data is not easily readable. Email is an example of unstructured data because while the busy inbox of a corporate human resources manager might be arranged by date, time or size, it would have been possible also to arrange it by exact subject and content. But this is almost impractical since people do not generally write about only one subject matter even in focused emails. 2.3 The ‘Vs’ of Big Data Phenomenon
  • 3. Copyright © 2018 IJECCE, All right reserved 44 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X The concept of big data began to gather momentum from 2003 when the five ‘Vs’ of big data was proposed to give foundational structure to the phenomenon that gave rise to its current form and definition. According to the foundational definition in [6], big data concept has the following five mainstreams: volume, velocity, value, variety, and veracity. Volume: Organizations gather their information from different sources including social media, cell phones, machine-to-machine (M2M) sensors, credit cards, business transactions, photographs, videos recordings, and so on. A vast amount of data is generated each second from these channels, which have become so large that storing and analyzing them would definitely constitute a problem, especially using our traditional database technology. According to [6], facebook alone generates about 12 billion messages a day, and over 400 million new pictures are uploaded every day. Peoples’ comments alone on issues of social importance are in millions, and the collection and analysis of such information have now become an engineering challenge. Velocity: By velocity, we refer to the speed at which new data is being generated from various sources such as e- mails, twitter messages, video clips, and social media updates. Such information now comes in torrents from all over the world on a daily basis. The streaming data need to be processed and analyzed at the same speed and in a timely manner for it to be of value to business organizations and the general society. Results of data analysis should equally be transmitted instantaneously to various users. Credit card transactions, for instance, need to be checked in seconds for fraudulent activities. Trading systems also need to analyze social media networks in seconds to obtain data for proper decisions to buy or sell shares, and so on. Variety: Variety refers to different types of data and the varied formats in which data are presented. Using the traditional database systems, information is stored as structured data mostly in numeric data format. But in today’s society, we receive information mostly as unstructured text documents, email, video, audio, financial transactions, and so on. The society no longer make use of only structured data arranged in columns of names, phone numbers, and addresses, that fits nicely into relational database tables. Recent research efforts have shown that more than 80% of today’s data is unstructured. Value: Data is only useful if it is turned into value. By value, we refer to the worth of the data being extracted. Business owners should not only embark on data gathering and analysis, but understand the costs and benefits of collecting and analyzing such data. The benefits to be derived from such information should exceed the cost of data gathering and analysis for it to be taken as valuable. Veracity: Veracity refers to the trustworthiness of the data. That is, how accurate is the data that have been gathered from the various sources? Big data analysis tries to verify the reliability and authenticity of data such as abbreviations and typos from twitter posts and some internet contents. It can make comparisons that bring out the correct and qualitative data sets. Big data technology also adopts new approaches that link, match, cleanse and transform data sets coming from various systems. Fig. 2.1. Five ‘Vs’ of Big data 2.4 Data Mining Data mining (also called knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information, such that can be used to increase revenue, cuts costs, or both [5]. Technically speaking, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. It is used for the following specific classes of activities: Classification, Estimation, Prediction, Association rules, Clustering, and Description. Classification Classification is a process of generalizing the data according to different instances. Classification consists of examining the features of a newly presented object and
  • 4. Copyright © 2018 IJECCE, All right reserved 45 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X assigning to it a predefined class. The classification task is characterized by the well-defined classes, and a training set consisting of reclassified examples. Some of the major kinds of classification algorithms in data mining include: Decision tree, k-nearest neighbor classifier, Naive Bayes, Apriori and the AdaBoost. Estimation Given some input data, we use estimation to come up with a value for some unknown continuous variables such as income, height or credit card balance. Estimation generally deals with continuously valued outcomes. Prediction Prediction may come in form of a simple statement that shows some expected outcome. It is a statement about the way things will happen in the future, often but not always based on experience or knowledge. Association Rules An association rule is a rule which implies certain association relationships among a set of objects (such as “occur together” or “one implies the other”) in a database. Clustering Clustering deals with the problem of locating a structure in a collection of unlabeled data. Generally, Knowledge Discovery (KDD) is an activity that unveils hidden knowledge and insights from a large volume of data, which involves data mining as its core and the most challenging step. Typically, data mining uncovers interesting patterns and relationships hidden in a large volume of raw data, and the results discovered may be used to make valuable predictions or future observations in the real world. Data mining has been used by a wide range of applications such as business, medicine, science and engineering and has led to numerous beneficial services to many walks of real businesses [10]. 2.5 Data Mining for Big the Data There is a serious mismatch between the demands of the big data management and the capabilities of current Data Base Management Systems (DBMS). Each of the V’s of big data (volume, variety, velocity, etc) clearly implies one distinct aspect of critical deficiencies of today’s DBMSs. Gigantic volume equally require great scalability and massive parallelism that are beyond the capability of current systems; the great variety of data types of big data is particularly unfit for current database systems due to the restriction of the closed processing architecture of the DBMS; the speed/velocity request of big data processing demands commensurate real-time efficiency which again is far beyond the capacity of current DBMSs. Another limitation of current DBMSs is that it typically require that data be first loaded into their storage systems, which also enforces a uniform format before any access/processing is allowed. Confronted with the huge volume of big data, the importing/loading stage could take hours, days, or even months and this will cause a substantial delay in the availability of the DBMSs) [5]. III. CRITICAL CHALLENGES OF DATA MINING IN BIG DATA TECHNOLOGY 3.1 Overview Accessible information is critical for a productive economy and a robust society. In the past, data mining techniques have been used to discover unknown patterns and relationships of interest from structured, homogeneous, and small datasets. Today, we are living in the big data era where enormous amounts of heterogeneous, semi- structured and unstructured data are continually generated at unprecedented speed and scale. Big data discloses the limitations of existing data mining techniques, which has resulted in a number of new challenges related to big data mining. The real-world data is heterogeneous, incomplete and noisy. Data in large quantities normally will be inaccurate or unreliable. These problems could be due to errors from instruments that measure the data or because of human errors. For data to have any value it needs to be discoverable, accessible and usable, and the significance of these requirements only increases the need for an effective and efficient data mining of the big data. Big data as an emerging trend and the need for big data mining is rising in all science and engineering domains. With effective big data mining, it will be possible to provide most relevant and most accurate social sensing feedback for a better understanding of our society in real time. In this section, we shall examine in detail, the perceived challenges of big data mining in order to seek ways of overcoming and trigger revolutionary breakthroughs in business and commerce. 3.2 Data Mining Challenges due to Intrinsically Distributed Complex Environment The shift towards intrinsically distributed complex problem solving environments is prompting a range of new data mining research and development problems, which can be classified into the following broad challenges: i. Distributed data Data for mining is usually stored in distributed computing environments and on heterogeneous platforms. For some technical and organizational reasons, it is impossible to gather all such data into a centralized location for processing. Consequently, there is need to develop better algorithms, tools, and services that facilitate the mining of distributed data. ii. Distributed operations In order to facilitate seamless integration of various computing resources into distributed data mining systems for complex problem solving, new algorithms, tools, and other IT infrastructure need to be developed. iii.Huge data There is continuous increase in the size of big data. Most machine learning algorithms have been created for handling only a small training set, for instance a few hundred examples. In order to use similar techniques in databases thousands of times bigger, much care must be taken. Having very much data is advantageous since they probably will show relations that really exist, but the number of possible descriptions of such a dataset is enormous. Some possible
  • 5. Copyright © 2018 IJECCE, All right reserved 46 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X ways of coping with this problem, are to design algorithms with lower complexity and to use heuristics to find the best classification rules. Simply using a faster computer is not always a good solution. There is need to develop algorithms for mining large, massive and heterogeneous data sets, which increases continuously, with rapid growth of complex data types, increasingly complex data sources, structures, and types. Mining of such data will only require the development of new methodologies, algorithms, and tools. Ideally speaking however, sampling and parallelization are effective tools to attack this scalability problem. iv. Data privacy, security, and governance Automated data mining in distributed environments raises serious issues in terms of data privacy, security, and governance. There is need to develop certain grid-based data mining technologies to address these issues. v. User-friendliness A good software system must ultimately hide its technological complexity from the user. To facilitate this, new user-friendly interfaces, software tools, and IT infrastructure will be needed in the areas of grid-supported workflow management, resource identification, allocation, and scheduling. vi. Data Mining Challenges due to Large, Unstructured, and Changing Datasets A number of data mining challenges also exist that are basically as a result of the large, unstructured and ever- changing datasets. Some of these challenges include the following: i. Data integrity Another challenge with distributed data management is that of data integrity. Data analysis can only be as good as the data that is being analyzed. A major implementation challenge is integrating conflicting or redundant data from different sources. If for instance, a bank may maintains credit card accounts on different databases, the addresses (or even the names) of a single cardholder may be different in each database. There is need to use software that translates data from one system to another and select the address most recently entered. ii.Interpretation of results Data mining output may require experts to correctly interpret the results, which might otherwise be meaningless to the average database user. iii. Multimedia data Most previous data mining algorithms targeted only the traditional data types (numeric, character, text, etc.). The use of multimedia data such as is found in GIS databases complicates or invalidates most of the traditional algorithms. iv. High dimensionality A conventional database schema will usually comprise of many different attributes. The fact remains that not all the attributes may be needed to solve a given data mining problem. The use of other attributes may only increase the overall complexity and decrease the efficiency of an algorithm. In fact, involving many attributes (dimensions) will portend difficulty in determining which ones should be used. One solution to this high dimensionality problem is to reduce the number of attributes, which is known as dimensionality reduction. However determining which attributes are not needed may not always be that easy. v.Noisy data In a large database, many of the attribute values will usually be invalid, inexact, or incorrect. The error may be due to erroneous instruments used in measuring some properties, or human error when registering it. There are two forms of noise in the data, which are corrupted values, and missing attribute value. Corrupted Values: Sometimes some of the values in the training set are altered from what they should have been. This may result in conflict between one or more tuples in the database and the rules already established. These extreme values are usually regarded by the system as noise, which are consequently ignored. The problem then will be how to know if these extreme values are correct or not. Missing Attribute Values: One or more of the attribute values may be missing completely. When an attribute value is missing for an object during classification, the system may check all matching rules and calculate the most probable classification. It is however advised that all incorrect values should be corrected before running data mining applications. vi. Irrelevant data Some attributes in the database might not be of interest to the data mining task being carried out. vii. Missing data During the pre-processing phase of knowledge discovery from databases, missing data may be replaced with estimates. This approach can however introduce errors and lead to invalid results from the overall data mining activity. viii.Dynamic data Big data is definitely not static. Most data mining algorithms, however, do assume a static database. Using such data mining algorithms therefore will require a complete re-run of the program anytime the database changes. But the truth is that most big data usually change continually. There is therefore a need to develop data mining algorithms with rules that reflect the content of the database at all times in order to make the best possible classification. Many existing data mining systems require that all the training examples are given at once. If something is changed at a later time, the whole learning process may have to be conducted again. The challenge now is how data mining systems can avoid this, and instead change its current rules according to updates performed. IV. OVERCOMING DATA MINING CHALLENGES IN BIG DATA 4.1 HACE Theorem: Modeling Big Data Characteristics The HACE theorem is a fundamental theorem that uniquely models big data characteristics. Big Data has the characteristics of being heterogeneous, of very large volume, autonomous sources with distributed and decentralized control, and a complex and evolving relationships among data. These characteristics pose
  • 6. Copyright © 2018 IJECCE, All right reserved 47 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X enormous challenge in determining useful knowledge and information from the big data. To explain the characteristics of big data, the native myth that tells the story of a number of blind men trying to size up an elephant, usually come into mind. The big elephant in will represent the Big Data in our context. The purpose of each blind man is to draw useful conclusion regarding the elephant (of course which will depend on the part of the animal he touched). Since the knowledge extracted from the experiment with the blind men will be according to the part of the information he collected, it is expected that the blind men will each conclude independently and differently that the elephant “feels” like a rope, a stone, a stick, a wall, a hose, and so on. To make the problem even more complex, assume that: i. The elephant is increasing very quickly in size and that the posture is constantly changing ii. Each blind man has his own information sources, possible inaccurate and unreliable that give him varying knowledge about what the elephant looks like (example, one blind man may share his own inaccurate view of the elephant to his friend) This information sharing will definitely make changes in the thinking of each blind man. Exploring information from Big Data is equivalent to the scenario illustrated above. It will involve merging or integrating heterogeneous information from different sources (just like the blind men) to arrive at the best possible and accurate knowledge regarding the information domain. This will certainly not be as easy as enquiring from each blind man about the elephant or drawing one single picture of the elephant from a joint opinion of the blind men. This is because each data source may express a different language, and may even have confidentiality concerns about the message they measured based on their own information exchange procedure. HACE theorem therefore suggests the following key characteristics of Big Data: a. Huge with Heterogeneous and Diverse Data Sources Big data is heterogeneous because different data collectors make use of their own big data protocols and schema for knowledge recording. Therefore, the nature of information gathered even from the same sources will vary based on the application and procedure of collection. This will end up in diversities of knowledge representation. b. Autonomous Sources and Decentralized Control With big data, each data source is distinct and autonomous with a distributed and decentralized control. Therefore in big data, there is no centralized control in the generation and collection of information. This setting is similar to the World Wide Web (WWW) where the function of each web server does not depend on the other servers. c. Complex Data Relationships and Evolving Knowledge Associations Originally, analyzing information using centralized information systems usually tries to discover the features that best represent every observation. That is, each object is treated as an independent entity without considering any other social connection with other objects within the domain or outside. Meanwhile, relationships and correlation are the most important factors of the human society. In our dynamic society, individuals must be represented alongside their social ties and connections which also evolve depending on certain temporal, spatial, and other factors. Example, the relationship between two or more facebook friends represents complex relationship because new friends are added every day. To maintain the relationship among these friends will therefore pose a huge challenge for developers. Other examples of complex data types are time-series information, maps, videos, and images. 4.2 The Hadoop: Effective HACE Theorem Application As earlier illustrated, HACE theorem models the Big Data to include: Huge, with heterogeneous and diverse data sources that represents diversities of knowledge representation, autonomous sources and decentralized control similar to the World Wide Web (WWW) where the function of each web server does not depend on the other servers, and complex data relationships with evolving knowledge associations. A popular open source implementation of the HACE theorem is Apache Hadoop. Hadoop has the capacity to link a number of relevant, disparate datasets for analysis in order to reveal new patterns, trends and insights. 4.3 Three-Tier Structure for Big Data Mining Models We propose a conceptual framework for big data mining that follows a three-tier structure as shown in figure 4.1 below: Fig. 4.1. Three-tier structure in big data mining
  • 7. Copyright © 2018 IJECCE, All right reserved 48 International Journal of Electronics Communication and Computer Engineering Volume 9, Issue 1, ISSN (Online): 2249–071X 4.4 Interpretation Tier 1: Considering figure 4.1, Tier 1 concentrates on accessing big datasets and performing arithmetic operations on them. Big data cannot practically be stored in a single location. They storages in diverse locations will also increase Therefore, effective computing platform will have to be in place to take up the distributed large scale datasets and perform arithmetic operations on them. In order to achieve such common operations in a distributed computing environment, parallel computing architecture must be employed. The major challenge at Tier 1 is that a single personal computer cannot possibly handle the big data mining because of the large quantity of data involved. To overcome this challenge at Tier 1, the concept of data distribution has to be used. For processing of big data in a distributed environment, we propose the adoption of such parallel programming models like the Hadoop’s MapReduce technique. Tier 2: Tier 2 focuses on the semantics and domain knowledge for the different Big Data applications. Such information will be of benefit to the data mining process and Tier 1 and to the data mining algorithms at Tier 3 by adding certain technical barriers and checks and balances to the process. Addition of technical barriers is necessary because information sharing and data privacy mechanisms between data producers and data consumers can be different for various domain applications. [9]. Tier 3: Algorithm Designs take place at Tier 3. Big data mining algorithms will help in tackling the difficulties raised by the Big Data volumes, complexity, dynamic data characteristics and distributed data. The algorithm at Tier 3 will contain three iterative stages: The first iteration is a pre- processing of all uncertain, sparse, heterogeneous, and multisource data. The second is the mining of dynamic and complex data after the pre-processing operation. Thirdly the global knowledge received by local learning matched with all relevant information is feedback into the pre-processing stage, while the model and parameters are adjusted according to the feedback. V. CONCLUSION AND RECOMMENDATIONS 5.1 Conclusion In this research paper, we have X-rayed some major challenges of big data mining some of which are as a result of the intrinsically distributed and complex environment, while some are due to the large, unstructured and dynamic datasets. We discover that the gradual shift towards distributed complex problem solving environments is now prompting a range of new data mining research and development problems. We have equally tried to proffer solution to the new challenges of big data mining by a proposition of the HACE theory to fully harness the potential benefits of the big data revolution and to trigger a revolutionary breakthrough in commerce and industry. The research also proposed a three-tier data mining structure for big data that provides accurate and relevant social sensing feedback for a better understanding of our society in real- time. 5.2 Recommendations Based on our study and observations so far made, we recommend that analysts should re-visit most of the data mining techniques in use today and develop distributed versions of the various data mining models available in order to meet the new challenges of the big data. Developers should take advantage of available big data technologies with affordable, open source, and easy-to-deploy platforms. REFERENCES [1] Ahmed R. and Karypis G., “Algorithms for Mining the Evolution of Conserved Relational States in Dynamic Networks,” Knowledge and Information Systems, vol. 33, no. 3, pp. 603-630, Dec. 2012. [2] Berkovich, S., Liao, D.: On Clusterization of big data Streams. In: 3rd International Conference on Computing for Geospatial Research and Applications, article no. 26. ACM Press, New York (2012). [3] Deepak S. Tamhane, Sultana N. Sayyad (2015): Big Data Analysis using HACE Theorem, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 4 Issue 1, January 2015 [4] Department of Finance and Deregulation Australian Government Big Data Strategy-Issue Paper March 2013 [5] Dunren Che, Mejdl Safran and Zhiyong Peng (2015): From Big Data to Big Data Mining: Challenges, Issues, and Opportunities, Department of Computer Science, Southern Illinois University Carbondale, Illinois 62901, USA [6] EYGM Limited, (2014): Big data: changing the way businesses compete and operate retrieved from (www.ey.com/GRCinsignts). [7] Longbing Cao, (2012): Combined Mining: Analyzing Object and Pattern Relations for Discovering Actionable Complex Patterns”, sponsored by Australian Research Council discovery Grants, 2012 [8] Prema Gadling and Mahip Bartere (2013): Implementing HACE Theorem for Big Data Processing, International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 [9] Rajneesh kumar Pandey, Uday A. Mande, “Survey on data mining and Information security in Big Data using HACE theorem,” International Engineering Research Journal (IERJ), vol. 1, issue 11 , 2016, ISSN 2395 -1621 [10] Sameeruddin Khan, Subbarao G. and REDDY G.V. (2016): Hace Theorem based Data Mining using Big Data, Research Inventy: International Journal of Engineering and Science Vol. 6, Issue 5(May 2016), PP -83-87 [11] Sagiroglu, S.; Sinanc, D. (2013) "Big data: A review," Collaboration Technologies and Systems (CTS), International Conference on, vol., no., pp.42, 47, 20-24 May 2013 [12] Wei Fan and Albert Bifet “Mining Big Data: Current Status and Forecast to the Future”, Vol 14, Issue 2, 2013. AUTHOR’S PROFILE Dr. Anthony Otuonye (anthony.otuonye@futo.edu.ng) Dr. Anthony Otuonye obtained a Bachelor of Science Degree in Computer Science from Imo State University Owerri, Nigeria, and a Master of Science Degree in Information Technology from the Federal University of Technology Owerri. He also holds a Ph.D. in Software Engineering from Nnamdi Azikiwe University Awka, Nigeria. Currently, Dr. Anthony is a Lecturer and Researcher with the Department of Information Management Technology, Federal University of Technology Owerri, Nigeria. He is a member of some professional bodies in his field and has authored several research articles most of which have been published in renowned international journals. He is married to Mrs. Edit Otuonye and, with their children, they live in Owerri, Imo State of Nigeria.