SlideShare a Scribd company logo
1 of 10
Download to read offline
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
DOI: 10.5121/ijdms.2016.8502 15
A NOVEL APPROACH FOR PROCESSING BIG DATA
D. Krishna Madhuri
Assistant Professor, Department of Computer Science and Engineering, GRIET, India
ABSTRACT
Applications of machine learning are widely used in the real world with either supervised or unsupervised
learning process. Recently emerged domain in the information technologies is Big Data which refers to
data with characteristics such as volume, velocity and variety. The existing machine learning approaches
cannot cope with Big Data. The processing of big data has to be done in an environment where distributed
programming is supported. In such environment like Hadoop, a distributed file system like Hadoop
Distributed File System (HDFS) is required to support scalable and efficient access to data. Distributed
environments are often associated with cloud computing and data centres. Naturally such environments are
equipped with GPUs (Graphical Processing Units) that support parallel processing. Thus the environment
is suitable for processing huge amount of data in short span of time. In this paper we propose a framework
that can have generic operations that support processing of big data. Our framework provides building
blocks to support clustering of unstructured data which is in the form of documents. We proposed an
algorithm that works in scheduling jobs of multiple users. We built a prototype application to demonstrate
the proof of concept. The empirical results revealed that the proposed framework shows 95% accuracy
when the results are compared with the ground truth.
KEYWORDS
Big data, Map Reduce, Hadoop, Distributed programming framework
1. INTRODUCTION
Big data, as the name indicates, is very voluminous data with other features such as velocity
(streaming data) and variety (data in different formats such as structured, unstructured and semi-
structured). When data is in the form of relational database, it is known as structured data. When
data is in the form of documents of any kind, it is known as unstructured data. When data is in the
form of XML, it is known as semi-structured data. The characteristics of big data are shown in
Figure 1 where it can be understood that velocity refers to speed, volume refers to volume and
variety refers to complexity. By processing big data it is possible to gain comprehensive business
intelligence that is needed by enterprises in the real world for making strategic decisions and
business growth.
Figure 1: Shows Characteristics of Big Data
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
16
Having understood what is big data it is essential to know why it needs to be mined. When big
data is analyzed, it is possible that important and hidden trends or patterns can be obtained. When
such data is not analyzed it may results in inaccurate business decisions. The ensuing sub section
throws light into the need for processing big data.
1.1 Need for Big Data Mining
Big data is the complete data of business with all details. When such data is processed in a
distributed environment, it is possible that it produces comprehensive business intelligence. If
partial data is processed, it may result in inaccurate business intelligence that cannot be used for
making expert decisions.
Figure 2 : Limited View of Users Provided Biased Conclusions
As shown in Figure 2, it is evident that people who are analyzing data were not able to produce
correct output. There is elephant over there and people has seen a part of it and understood
differently. For instance the leg of the animal is understood like a tree. Probably it is the act of
blind person who cannot see the complete picture but can touch and say what it is. Here it is very
obvious that biased conclusions are provided. These conclusions are wrong and they cannot help
in making well informed decisions. Thus it is understood that when whole data (complete data) is
considered, it can produce intelligence for making good decisions.
1.2 Big Data Evolution
Right from 1968 there has been evolution of big data. It has not happened in a year or two. It is
the continuous improvement in data analytics over a period of time. In 1968 Online Transaction
Processing (OLTP) was explored with day to day transactions stored in database and processed.
In 1983, data warehousing technology came into existence. This has helped to have historical data
to be obtained from OLTP data in order to use it for data mining. Thus historical data can be
processed in order to make business intelligence out of it. This phenomenon was named as Online
Analytical Processing (OLAP).
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October
As shown in Figure 3, it is evident that the evolution is from OLTP to OLAP to RTAP. Real
Time Analytical Processing (RTAP) is the subject pertaining to stream processing and big data
analytics. The processing of data at rest, historical data and data in m
improvements from 1968 to 2010. Finally it resulted in big data and its real time processing for
business analysis.
1.3 Map Reduce Programming
Map Reduce programming is a new programming approach based on object oriented
programming using Java programming language. It is the process of writing program with two
parts such as Map and Reduce. Map takes care of processing big data while reduce takes care
combining Map results provided by thousands of worker nodes in distributed environment. This
kind of programming is supported by cloud computing, data centres and the presence of modern
processors such as Graphical Processing Units (GPUs). The power of
leveraged with big data in such environments.
Figure 4:
The architecture shown in Figure 1 is related to Hadoop which is one of the distributed
programming frameworks. The Map
the figure. First of all the input files are divided into multiple parts and each part is given a
mapper present in worker nodes. The mapper produces its output. Then the intermediate files are
taken by other set of worker nodes for performing
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October
Figure 3: Evolution of Big Data
As shown in Figure 3, it is evident that the evolution is from OLTP to OLAP to RTAP. Real
Time Analytical Processing (RTAP) is the subject pertaining to stream processing and big data
analytics. The processing of data at rest, historical data and data in motion were the
improvements from 1968 to 2010. Finally it resulted in big data and its real time processing for
Map Reduce Programming
Reduce programming is a new programming approach based on object oriented
programming using Java programming language. It is the process of writing program with two
parts such as Map and Reduce. Map takes care of processing big data while reduce takes care
combining Map results provided by thousands of worker nodes in distributed environment. This
kind of programming is supported by cloud computing, data centres and the presence of modern
processors such as Graphical Processing Units (GPUs). The power of parallel processing is
leveraged with big data in such environments.
Figure 4:Map Reduce programming paradigm
The architecture shown in Figure 1 is related to Hadoop which is one of the distributed
programming frameworks. The Map Reduce programming supported by Hadoop is illustrated in
the figure. First of all the input files are divided into multiple parts and each part is given a
mapper present in worker nodes. The mapper produces its output. Then the intermediate files are
by other set of worker nodes for performing Reduce task. Once the reduce task is
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
17
As shown in Figure 3, it is evident that the evolution is from OLTP to OLAP to RTAP. Real
Time Analytical Processing (RTAP) is the subject pertaining to stream processing and big data
otion were the
improvements from 1968 to 2010. Finally it resulted in big data and its real time processing for
Reduce programming is a new programming approach based on object oriented
programming using Java programming language. It is the process of writing program with two
parts such as Map and Reduce. Map takes care of processing big data while reduce takes care of
combining Map results provided by thousands of worker nodes in distributed environment. This
kind of programming is supported by cloud computing, data centres and the presence of modern
parallel processing is
The architecture shown in Figure 1 is related to Hadoop which is one of the distributed
Reduce programming supported by Hadoop is illustrated in
the figure. First of all the input files are divided into multiple parts and each part is given a
mapper present in worker nodes. The mapper produces its output. Then the intermediate files are
Once the reduce task is
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
18
completed, it is possible to have output files generated. Both input and output files are stored in
distributed file system Hadoop.
In the existing systems, machine learning algorithms were not optimized for big data processing.
In this paper we proposed a framework that supports generic machine learning process that is
processing of big data. Stated differently, the proposed system takes big data (documents) as
input and produces clusters. There are many operations involved that are related to processing
documents. The operations include identification of keywords, finding feature space, TF/IDF and
so on. Finally the resultant clusters. The remainder of the paper is structured as follows. Section II
provides review of literature. Section III presents the proposed system in detail. Section IV
presents experimental results while section V concludes the paper.
2. RELATED WORKS
Machine learning is the process of providing intelligence to programs so as to help them learn.
The learning process may be of two types known as supervised learning and unsupervised
learning. Various methods of machine learning can be found in [1], [2], [3], [4], [5] and [6].
These techniques are used in the real world. However, in the area of vision and natural language
processing there is still possibility for further research. Moreover, the existing machine learning
algorithms are not optimized for big data processing. The algorithms are to be developed keeping
the distributed programming environment like Hadoop and new programming paradigm such as
Map Reduce.
As explored in [7] Hadoop is a distributed programming framework that is used to process huge
amount of data. A distributed file system is associated with Hadoop to support scalable and
available processing of data. Hadoop can be compared with some in-memory products. However,
Hadoop provides superior performance than its in-memory counterparts discussed in [8] and [9].
There are other systems that are powerful and versatile with low level programming interfaces
[10], [11]. The problem with them is that they are specific and cannot provide general high level
programming interface, scheduling and other needed mechanisms.
Making models and working with models is found in Pregel [12] which is graph-centric platform.
It supports partitioning of models with in-built scheduling and mechanisms for consistency.
However there is no realization of its widespread usage. Lohr [13] opined that big data bring new
possibilities with machine learning algorithms. Moreover the big data is a wealth for making
strategies in business. For instance Google is using such data in order to drive its business. Chen
et al. [14] explored business analytics and intelligence on big data. Management information
systems are improved with big data science. Jacobs [15] opined that organizations need to be
careful about the pathologies of big data and carefully consider what exactly big data is. Chen et
al. [16] explored the possibilities of big data in future with a good survey of articles on big data.
They opined that big data can add big value to businesses when harnessed properly. Labrinidis
and Jagadish [17] investigated opportunities with big data such as ability to obtain comprehensive
business intelligence. They also found many challenges such as environment, algorithms, dealing
with different kinds of data and so on. Kraska [18] discussed about the increasing need of big data
and its processing which looks like finding a needle in haystack.
Agrawal et al. [19] explored the current state of the art on big data besides future opportunities.
They focused on scalable DBMS that can help in processing big data. Snijders et al. [20] opined
that knowledge gaps are filed with big data processing. Herodotou et al. [21] explored on big data
and found that it can help in agility and depth in information processing and obtaining business
intelligence. Cuzzocrea et al. [22] focused on big data and explored how big data revolution can
work with multi-dimensional data for business intelligence. Chen and Zhang [23] studied big data
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
19
process in the context of data-intensive applications to find methodologies that can help in
processing big data. Katal et al. [24] discussed on the tools already available for processing big
data. The tools like Hadoop and other related products help in big data analytics. Bizer et al. [25]
understood that four perspectives such as defining the problem, searching (processing),
transforming, and entity resolution are important to obtain meaningful information. In this paper
we proposed a generic framework that supports big data processing with machine learning
algorithms. It also supports scheduling jobs in multi-user environments.
3. PROPOSED FRAMEWORK FOR DISTRIBUTED MACHINE LEARNING
We proposed a machine learning framework which is generic in nature with high level
programming interface. It is meant for processing big data in multi-user environment. It provides
common operations need to process unstructured data. Besides it supports algorithm for
scheduling multiple jobs of users concurrently in a distributed environment. In fact the framework
can support any machine learning algorithm for clustering documents. It supports information
retrieval and natural language processing with machine learning to understand big data and
perform clustering of documents.
Figure 5: Generic Framework for Machine Learning on Big Data
The framework shown in Figure 5 is generic in nature and works with any set of documents that
constitute big data. The framework takes big data as input and produces clusters. The operations
involved in the framework include keyword identification, construction of feature space,
computing TF/IDF, similarity computation and clustering. Keywords are the important words
identified. The feature space is constructed in order to process the data. Afterwards, TF/IDF
measure is used to find statistics based on term frequency. This will help in finding similarity
between documents in order to make well-informed decisions on clustering. There are many
distance measures that can be used for finding similarities between documents. They are as
follows.
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
20
Listing 1: Shows Different Functions that are used as Similarity Measures
The listing 1 shows many similarity measures such as Jaccard function, cosine function,
Euclidean Distance, Extended Jacccard Function and Dice function. All the functions are capable
of supporting the similarity computation in big data processing. Especially in this paper they are
used for finding similarity between two documents. The similarity measure results in a value
between 0.0 to 1.0. The more this value is the more the similarity is between any two documents.
Here is the proposed algorithm for multi-user job scheduling.
Algorithm 1: Multi-User Job Scheduling Algorithm
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October
The algorithm is responsible to schedule jobs in order to improve the productivity of big data
processing. The distributed machine learning
schedules them properly in such a way that they are processed in an optimi
waiting time concept is used to ensure that the jobs are given their turn and processing of big data
is carried out in efficient manner.
4. EXPERIMENTAL RESULTS
We built a custom simulator (prototype application) that simulates distributed programming
framework such as Hadoop with Map and Reduce functionalities. It supports multiple nodes in
the processing and multiple jobs provided by many users simultaneously. T
with the application in terms of processing Big Data (documents) which is in unstructured format.
The results are observed in terms of number of concurrent users and the performance of the
proposed framework in accurate clustering of
with ground truth.
No. of
Users
10
Time (sec) 1
Table 1: Shows Performance in Terms o
As shown in Table 1, there is relation between number of users and the time taken to process big
data. In the simulated environment
performance in terms of time taken.
also caused more time to be taken.
Figure 6: Clustering
As shown in Figure 6, it is evident that
horizontal axis represents time taken in order
results reveals a fact that when number of users is increased the time taken for processing big data
is also increased proportionately.
the time taken for 10 users is 1 second.
0
2
4
6
8
10
12
14
16
18
10
Time(sec)
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October
The algorithm is responsible to schedule jobs in order to improve the productivity of big data
processing. The distributed machine learning environment takes number of user jobs and
schedules them properly in such a way that they are processed in an optimized fashion. The
waiting time concept is used to ensure that the jobs are given their turn and processing of big data
manner.
ESULTS
We built a custom simulator (prototype application) that simulates distributed programming
framework such as Hadoop with Map and Reduce functionalities. It supports multiple nodes in
the processing and multiple jobs provided by many users simultaneously. The framework is tested
with the application in terms of processing Big Data (documents) which is in unstructured format.
observed in terms of number of concurrent users and the performance of the
proposed framework in accurate clustering of documents. The clustered documents are evaluated
20 30 40 50 60 70 80 90 100
2.5 5 7.5 10 12.5 13.5 14 16 17
Performance in Terms of Workload VS. Time Taken
As shown in Table 1, there is relation between number of users and the time taken to process big
data. In the simulated environment multiple users and their jobs are considered
performance in terms of time taken. The results revealed that when number of users increase, it
more time to be taken.
Clustering Performance in Multi-User Environment
, it is evident that number of users is represented by vertical axis while the
time taken in order to perform clustering of big data. The
a fact that when number of users is increased the time taken for processing big data
is also increased proportionately. When number of users is 100 the time taken is 17 seconds while
the time taken for 10 users is 1 second.
10 20 30 40 50 60 70 80 90 100
Workload (No. of Users)
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
21
The algorithm is responsible to schedule jobs in order to improve the productivity of big data
takes number of user jobs and
zed fashion. The
waiting time concept is used to ensure that the jobs are given their turn and processing of big data
We built a custom simulator (prototype application) that simulates distributed programming
framework such as Hadoop with Map and Reduce functionalities. It supports multiple nodes in
he framework is tested
with the application in terms of processing Big Data (documents) which is in unstructured format.
observed in terms of number of concurrent users and the performance of the
The clustered documents are evaluated
As shown in Table 1, there is relation between number of users and the time taken to process big
considered to measure
umber of users increase, it
number of users is represented by vertical axis while the
to perform clustering of big data. The trend in the
a fact that when number of users is increased the time taken for processing big data
seconds while
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October
5. EVALUATION
Human experts are involved in evaluating the work of the proposed framework. The clustering of
big is done manually by spending time on a portion of big data in order to establish ground truth.
The ground truth is then compared with the performance of the system. Thus the proposed
framework with underlying algorithm is evaluated.
Table
As shown in Figure 7, the proposed framework performance is presented in the form of true
positives and false positives. The system has showed 95% true positives and
This is carried out based on the ground truth values provided by human experts. The results
revealed that the proposed system can be used effectively for processing big data.
needs further refinement in order to make it more
Figure
The results shown in Figure 8 reveal that there is performance difference between tag
search and TF/IDF based search.
percentage performance is shown in vertical axis. There is slight difference between the two
kinds of searches in performance.
0
20
40
60
80
100
Percentage
0%
20%
40%
60%
80%
100%
Top-5
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October
Human experts are involved in evaluating the work of the proposed framework. The clustering of
by spending time on a portion of big data in order to establish ground truth.
The ground truth is then compared with the performance of the system. Thus the proposed
algorithm is evaluated.
True
Positives
False
Positives
95 5
Table 1: Shows Results of Evaluation
Figure 7: Evaluation Results
, the proposed framework performance is presented in the form of true
The system has showed 95% true positives and 5% false positives.
This is carried out based on the ground truth values provided by human experts. The results
revealed that the proposed system can be used effectively for processing big data.
needs further refinement in order to make it more useful and add value to enterprises.
Figure 8: TF/IDF and Tab Based Search
The results shown in Figure 8 reveal that there is performance difference between tag
search and TF/IDF based search. The top-n values are represented by horizontal axis while the
percentage performance is shown in vertical axis. There is slight difference between the two
kinds of searches in performance.
True Positives False Positives
Measures
Top-10 Top-15
Tag-based Search
TF/IDF
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
22
Human experts are involved in evaluating the work of the proposed framework. The clustering of
by spending time on a portion of big data in order to establish ground truth.
The ground truth is then compared with the performance of the system. Thus the proposed
, the proposed framework performance is presented in the form of true
5% false positives.
This is carried out based on the ground truth values provided by human experts. The results
revealed that the proposed system can be used effectively for processing big data. However, it
useful and add value to enterprises.
The results shown in Figure 8 reveal that there is performance difference between tag-based
n values are represented by horizontal axis while the
percentage performance is shown in vertical axis. There is slight difference between the two
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
23
6. CONCLUSIONS AND FUTURE WORK
In this paper we proposed a framework that supports distributed programming. The framework
supports processing of big data with its pre-defined building blocks. It supports machine learning
approach with natural language processing in order to perform clustering of given big data
(documents). It makes use of Map Reduce programming paradigm in a simulated environment.
The framework has provision for generic functionalities that can be reused by cloud users in the
real world. Unstructured data is subjected to machine learning in order to obtain intelligence
which is used to determine clusters. Recently emerged domain in the information technologies is
Big Data which refers to data with characteristics such as volume, velocity and variety. The
existing machine learning approaches cannot cope with Big Data. In this paper our framework
with underlying algorithm supports scheduling jobs of multiple users concurrently. We built a
custom simulator that demonstrates processing of big data in distributed environment. Our
empirical results revealed that the performance of proposed framework is high in terms of
accuracy when compared with that of ground truth. This research can be extended further to
improve the framework to support different kinds of big data besides supporting multiple data
mining operations on big data.
REFERENCES
[1] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel coordinate descent for l1-regularized
loss minimization,” in Proc. Int. Conf. Mach. Learn., 2011, pp. 321–328.
[2] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in Proc. Adv. Neural Inf.
Process. Syst., 2011, pp. 873–881.
[3] M. Zinkevich, J. Langford, and A. J. Smola, “Slow learners are fast,” in Proc. Adv. Neural Inf.
Process. Syst., 2009, pp. 2331–2339.
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P.
Tucker, K. Yang, and A. Ng, “Large scale distributed deep networks,” in Proc. Adv. Neural Inf.
Process. Syst., 2012, pp. 1232–1240.
[5] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” J. Mach.
Learn. Res., vol. 14, pp. 1303–1347, 2013.
[6] S. A. Williamson, A. Dubey, and E. P. Xing, “Parallel Markov chain Monte Carlo for nonparametric
mixture models,” in Proc.Int. Conf. Mach. Learn., 2013, pp. 98–106.
[7] T. White, Hadoop: The Definitive Guide. Sebastopol, CA, USA: O’Reilly Media, 2012
[8] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein, “Distributed
GraphLab: A framework for machine learning and data mining in the cloud,” in Proc. VLDB
Endowment, vol. 5, pp. 716–727, 2012.
[9] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with
working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, p. 10.
[10] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and
B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in Proc. 11th USENIX
Conf. Operating Syst. Des. Implementation, 2014, pp. 583–598.
[11] R. Power and J. Li, “Piccolo: Building fast, distributed programs with partitioned tables,” in Proc.
USENIX Conf. Operating Syst. Des. Implementation, article 10, 2010, pp. 1–14.
[12] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel:
A system for large-scale graph processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010,
pp. 135–146.
[13] STEVE LOHR. (2012). The Age of Big Data. Big Data’s Impact in the World, p1-5.
[14] Hsinchun Chen,Roger H. L. Chiang and Veda C. Storey. (2012). BUSINESS INTELLIGENCE AND
ANALYTICS: FROM BIG DATA TO BIG IMPACT. MIS Quarterly. 36 (4), p1165-1188.
[15] Adam Jacobs. (2009). The Pathologies of Big Data. communicati ons of the acm. 52 (8), p1-9.
[16] Min Chen ,Shiwen Mao and Yunhao Liu. (2014). Big Data: A Survey.Springer Science+Business
Media, p1-39.
[17] Alexandros Labrinidis and H. V. Jagadish. (2012). Challenges and Opportunities with Big Data.
Proceedings of the VLDB Endowment. 5 (12), p1-2.
[18] Tim Kraska. (2013). Finding the Needle in the Big Data Systems Haystack. IEEE Computer Society,
p1-3.
International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016
24
[19] Divyakant Agrawal,Sudipto Das and Amr El Abbadi. (2011). Big Data and Cloud Computing:
Current State and Future Opportunities. ACM, p1-4.
[20] Chris Snijders, Uwe Matzat and Ulf-Dietrich Reips. (2012). “Big Data”: Big Gaps of Knowledge in
the Field of Internet Science. International Journal of Internet Science. 7 (1), p1-5.
[21] Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin
and Shivnath Babu. (2011). Starfish: A Selftuning System for Big Data Analytics. Biennial
Conference on Innovative Data Systems Research, p1-12.
[22] Alfredo Cuzzocrea,Il-Yeol Song and Karen C. Davis. (2011). Analytics over Large-Scale
Multidimensional Data: The Big Data Revolution.ACM, p1-3.
[23] C.L. Philip Chen and Chun-Yang Zhang. (2014). Data-intensive applications, challenges, techniques
and technologies: A survey on Big Data. Information Sciences, p1-34.
[24] Avita Katal,Mohammad Wazid and R H Goudar. (2013). Big Data: Issues, Challenges, Tools and
Good Practices. IEEE. . (.), p1-6.
[25] Christian Bizer, Peter Boncz, Michael L. Brodie and Orri Erling. (2011). The Meaningful Use of Big
Data: Four Perspectives – Four Challenges.SIGMOD Record. 40 (4), p1-5.
AUTHORS
D.Krishna Madhuri Received her B.Tech degree in computer Science and
engineering from Swarnandhra college of Engineering & Technology, Narsapur,
Andhra Pradesh in 2009,and M.Tech Degree in Computer Science and Engineering
from Sridevi Women’s Engineering college Hyderabad, Telangana in 2012. She is
currently pursuing her Ph.D degree from Sri Satyasai University, Sehore, Madhya
Pradesh. Currently she is working as Assistant Professor in the department of
Computer Science and Engineering of Gokaraju Rangaraju Institute of Engineering
and Technology, Hyderabad, Telangana India. Herresearch interest includes Data
Bases, Data Mining and Big Data.

More Related Content

What's hot

BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentIJERA Editor
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET Journal
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...IJCSIS Research Publications
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictiveEditorJST
 
Principles of GIS unit 2
Principles of GIS unit 2Principles of GIS unit 2
Principles of GIS unit 2SanjanaKhemka1
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...IJECEIAES
 
Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...
Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...
Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...IJAAS Team
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecturehasanshan
 

What's hot (20)

U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud Environment
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
 
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive
 
Principles of GIS unit 2
Principles of GIS unit 2Principles of GIS unit 2
Principles of GIS unit 2
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...
 
Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...
Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...
Workload Aware Incremental Repartitioning of NoSQL for Online Transactional P...
 
T9
T9T9
T9
 
B1803031217
B1803031217B1803031217
B1803031217
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
Lec1
Lec1Lec1
Lec1
 

Viewers also liked

Un área verde para todos
Un área verde para todosUn área verde para todos
Un área verde para todosdec-admin
 
Fortalecimiento de valores
Fortalecimiento de valoresFortalecimiento de valores
Fortalecimiento de valoresdec-admin
 
Kromatografi gas ii revisi
Kromatografi gas ii revisiKromatografi gas ii revisi
Kromatografi gas ii revisiRafi Mariska
 
411. acciones focalizadas al reciclaje
411. acciones focalizadas al reciclaje411. acciones focalizadas al reciclaje
411. acciones focalizadas al reciclajedec-admin
 
256.no + bullying
256.no + bullying256.no + bullying
256.no + bullyingdec-admin
 
Cem 350 hud residential energy conservation 9 2016
Cem 350 hud residential energy conservation 9 2016Cem 350 hud residential energy conservation 9 2016
Cem 350 hud residential energy conservation 9 2016miresmaeil
 
Instructie leerlingen olz nov2016
Instructie leerlingen olz nov2016Instructie leerlingen olz nov2016
Instructie leerlingen olz nov2016Maykel Kicken
 
63. una escuela limpia y preocupada por el deterioro ambiental
63. una escuela limpia y preocupada por el deterioro ambiental63. una escuela limpia y preocupada por el deterioro ambiental
63. una escuela limpia y preocupada por el deterioro ambientaldec-admin
 
183.participación en la formación de huertos escolares y cuidado de la flora ...
183.participación en la formación de huertos escolares y cuidado de la flora ...183.participación en la formación de huertos escolares y cuidado de la flora ...
183.participación en la formación de huertos escolares y cuidado de la flora ...dec-admin
 
Connecting with different e learning communities
Connecting with different e learning communitiesConnecting with different e learning communities
Connecting with different e learning communitiesTanzil Al Gazmir
 
Pequeñas cosas que hacen grandes diferencias
Pequeñas cosas que hacen grandes diferenciasPequeñas cosas que hacen grandes diferencias
Pequeñas cosas que hacen grandes diferenciasdec-admin
 
179.iniciativa juvenil
179.iniciativa juvenil179.iniciativa juvenil
179.iniciativa juvenildec-admin
 
Week 1 project (corrected)
Week 1 project (corrected)Week 1 project (corrected)
Week 1 project (corrected)Anthony Clay
 
DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...
DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...
DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...ijdms
 
A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...
A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...
A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...ijdms
 

Viewers also liked (20)

PAFR 2014_Web
PAFR 2014_WebPAFR 2014_Web
PAFR 2014_Web
 
Transports - Cars
Transports - CarsTransports - Cars
Transports - Cars
 
Un área verde para todos
Un área verde para todosUn área verde para todos
Un área verde para todos
 
Fortalecimiento de valores
Fortalecimiento de valoresFortalecimiento de valores
Fortalecimiento de valores
 
Kromatografi gas ii revisi
Kromatografi gas ii revisiKromatografi gas ii revisi
Kromatografi gas ii revisi
 
Mansi CV Updated
Mansi CV UpdatedMansi CV Updated
Mansi CV Updated
 
411. acciones focalizadas al reciclaje
411. acciones focalizadas al reciclaje411. acciones focalizadas al reciclaje
411. acciones focalizadas al reciclaje
 
256.no + bullying
256.no + bullying256.no + bullying
256.no + bullying
 
Cem 350 hud residential energy conservation 9 2016
Cem 350 hud residential energy conservation 9 2016Cem 350 hud residential energy conservation 9 2016
Cem 350 hud residential energy conservation 9 2016
 
Instructie leerlingen olz nov2016
Instructie leerlingen olz nov2016Instructie leerlingen olz nov2016
Instructie leerlingen olz nov2016
 
63. una escuela limpia y preocupada por el deterioro ambiental
63. una escuela limpia y preocupada por el deterioro ambiental63. una escuela limpia y preocupada por el deterioro ambiental
63. una escuela limpia y preocupada por el deterioro ambiental
 
Voter turnout
Voter turnoutVoter turnout
Voter turnout
 
183.participación en la formación de huertos escolares y cuidado de la flora ...
183.participación en la formación de huertos escolares y cuidado de la flora ...183.participación en la formación de huertos escolares y cuidado de la flora ...
183.participación en la formación de huertos escolares y cuidado de la flora ...
 
PAFR 2015_Web
PAFR 2015_WebPAFR 2015_Web
PAFR 2015_Web
 
Connecting with different e learning communities
Connecting with different e learning communitiesConnecting with different e learning communities
Connecting with different e learning communities
 
Pequeñas cosas que hacen grandes diferencias
Pequeñas cosas que hacen grandes diferenciasPequeñas cosas que hacen grandes diferencias
Pequeñas cosas que hacen grandes diferencias
 
179.iniciativa juvenil
179.iniciativa juvenil179.iniciativa juvenil
179.iniciativa juvenil
 
Week 1 project (corrected)
Week 1 project (corrected)Week 1 project (corrected)
Week 1 project (corrected)
 
DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...
DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...
DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS OF CONCURRENCY CONTROL ALGORI...
 
A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...
A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...
A PRUDENT-PRECEDENCE CONCURRENCY CONTROL PROTOCOL FOR HIGH DATA CONTENTION DA...
 

Similar to A NOVEL APPROACH FOR PROCESSING BIG DATA

A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportAhmad El Tawil
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce ijujournal
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Rohit Srivastava
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
 

Similar to A NOVEL APPROACH FOR PROCESSING BIG DATA (20)

A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Big data
Big dataBig data
Big data
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Big Data
Big DataBig Data
Big Data
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

A NOVEL APPROACH FOR PROCESSING BIG DATA

  • 1. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 DOI: 10.5121/ijdms.2016.8502 15 A NOVEL APPROACH FOR PROCESSING BIG DATA D. Krishna Madhuri Assistant Professor, Department of Computer Science and Engineering, GRIET, India ABSTRACT Applications of machine learning are widely used in the real world with either supervised or unsupervised learning process. Recently emerged domain in the information technologies is Big Data which refers to data with characteristics such as volume, velocity and variety. The existing machine learning approaches cannot cope with Big Data. The processing of big data has to be done in an environment where distributed programming is supported. In such environment like Hadoop, a distributed file system like Hadoop Distributed File System (HDFS) is required to support scalable and efficient access to data. Distributed environments are often associated with cloud computing and data centres. Naturally such environments are equipped with GPUs (Graphical Processing Units) that support parallel processing. Thus the environment is suitable for processing huge amount of data in short span of time. In this paper we propose a framework that can have generic operations that support processing of big data. Our framework provides building blocks to support clustering of unstructured data which is in the form of documents. We proposed an algorithm that works in scheduling jobs of multiple users. We built a prototype application to demonstrate the proof of concept. The empirical results revealed that the proposed framework shows 95% accuracy when the results are compared with the ground truth. KEYWORDS Big data, Map Reduce, Hadoop, Distributed programming framework 1. INTRODUCTION Big data, as the name indicates, is very voluminous data with other features such as velocity (streaming data) and variety (data in different formats such as structured, unstructured and semi- structured). When data is in the form of relational database, it is known as structured data. When data is in the form of documents of any kind, it is known as unstructured data. When data is in the form of XML, it is known as semi-structured data. The characteristics of big data are shown in Figure 1 where it can be understood that velocity refers to speed, volume refers to volume and variety refers to complexity. By processing big data it is possible to gain comprehensive business intelligence that is needed by enterprises in the real world for making strategic decisions and business growth. Figure 1: Shows Characteristics of Big Data
  • 2. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 16 Having understood what is big data it is essential to know why it needs to be mined. When big data is analyzed, it is possible that important and hidden trends or patterns can be obtained. When such data is not analyzed it may results in inaccurate business decisions. The ensuing sub section throws light into the need for processing big data. 1.1 Need for Big Data Mining Big data is the complete data of business with all details. When such data is processed in a distributed environment, it is possible that it produces comprehensive business intelligence. If partial data is processed, it may result in inaccurate business intelligence that cannot be used for making expert decisions. Figure 2 : Limited View of Users Provided Biased Conclusions As shown in Figure 2, it is evident that people who are analyzing data were not able to produce correct output. There is elephant over there and people has seen a part of it and understood differently. For instance the leg of the animal is understood like a tree. Probably it is the act of blind person who cannot see the complete picture but can touch and say what it is. Here it is very obvious that biased conclusions are provided. These conclusions are wrong and they cannot help in making well informed decisions. Thus it is understood that when whole data (complete data) is considered, it can produce intelligence for making good decisions. 1.2 Big Data Evolution Right from 1968 there has been evolution of big data. It has not happened in a year or two. It is the continuous improvement in data analytics over a period of time. In 1968 Online Transaction Processing (OLTP) was explored with day to day transactions stored in database and processed. In 1983, data warehousing technology came into existence. This has helped to have historical data to be obtained from OLTP data in order to use it for data mining. Thus historical data can be processed in order to make business intelligence out of it. This phenomenon was named as Online Analytical Processing (OLAP).
  • 3. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October As shown in Figure 3, it is evident that the evolution is from OLTP to OLAP to RTAP. Real Time Analytical Processing (RTAP) is the subject pertaining to stream processing and big data analytics. The processing of data at rest, historical data and data in m improvements from 1968 to 2010. Finally it resulted in big data and its real time processing for business analysis. 1.3 Map Reduce Programming Map Reduce programming is a new programming approach based on object oriented programming using Java programming language. It is the process of writing program with two parts such as Map and Reduce. Map takes care of processing big data while reduce takes care combining Map results provided by thousands of worker nodes in distributed environment. This kind of programming is supported by cloud computing, data centres and the presence of modern processors such as Graphical Processing Units (GPUs). The power of leveraged with big data in such environments. Figure 4: The architecture shown in Figure 1 is related to Hadoop which is one of the distributed programming frameworks. The Map the figure. First of all the input files are divided into multiple parts and each part is given a mapper present in worker nodes. The mapper produces its output. Then the intermediate files are taken by other set of worker nodes for performing International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October Figure 3: Evolution of Big Data As shown in Figure 3, it is evident that the evolution is from OLTP to OLAP to RTAP. Real Time Analytical Processing (RTAP) is the subject pertaining to stream processing and big data analytics. The processing of data at rest, historical data and data in motion were the improvements from 1968 to 2010. Finally it resulted in big data and its real time processing for Map Reduce Programming Reduce programming is a new programming approach based on object oriented programming using Java programming language. It is the process of writing program with two parts such as Map and Reduce. Map takes care of processing big data while reduce takes care combining Map results provided by thousands of worker nodes in distributed environment. This kind of programming is supported by cloud computing, data centres and the presence of modern processors such as Graphical Processing Units (GPUs). The power of parallel processing is leveraged with big data in such environments. Figure 4:Map Reduce programming paradigm The architecture shown in Figure 1 is related to Hadoop which is one of the distributed programming frameworks. The Map Reduce programming supported by Hadoop is illustrated in the figure. First of all the input files are divided into multiple parts and each part is given a mapper present in worker nodes. The mapper produces its output. Then the intermediate files are by other set of worker nodes for performing Reduce task. Once the reduce task is International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 17 As shown in Figure 3, it is evident that the evolution is from OLTP to OLAP to RTAP. Real Time Analytical Processing (RTAP) is the subject pertaining to stream processing and big data otion were the improvements from 1968 to 2010. Finally it resulted in big data and its real time processing for Reduce programming is a new programming approach based on object oriented programming using Java programming language. It is the process of writing program with two parts such as Map and Reduce. Map takes care of processing big data while reduce takes care of combining Map results provided by thousands of worker nodes in distributed environment. This kind of programming is supported by cloud computing, data centres and the presence of modern parallel processing is The architecture shown in Figure 1 is related to Hadoop which is one of the distributed Reduce programming supported by Hadoop is illustrated in the figure. First of all the input files are divided into multiple parts and each part is given a mapper present in worker nodes. The mapper produces its output. Then the intermediate files are Once the reduce task is
  • 4. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 18 completed, it is possible to have output files generated. Both input and output files are stored in distributed file system Hadoop. In the existing systems, machine learning algorithms were not optimized for big data processing. In this paper we proposed a framework that supports generic machine learning process that is processing of big data. Stated differently, the proposed system takes big data (documents) as input and produces clusters. There are many operations involved that are related to processing documents. The operations include identification of keywords, finding feature space, TF/IDF and so on. Finally the resultant clusters. The remainder of the paper is structured as follows. Section II provides review of literature. Section III presents the proposed system in detail. Section IV presents experimental results while section V concludes the paper. 2. RELATED WORKS Machine learning is the process of providing intelligence to programs so as to help them learn. The learning process may be of two types known as supervised learning and unsupervised learning. Various methods of machine learning can be found in [1], [2], [3], [4], [5] and [6]. These techniques are used in the real world. However, in the area of vision and natural language processing there is still possibility for further research. Moreover, the existing machine learning algorithms are not optimized for big data processing. The algorithms are to be developed keeping the distributed programming environment like Hadoop and new programming paradigm such as Map Reduce. As explored in [7] Hadoop is a distributed programming framework that is used to process huge amount of data. A distributed file system is associated with Hadoop to support scalable and available processing of data. Hadoop can be compared with some in-memory products. However, Hadoop provides superior performance than its in-memory counterparts discussed in [8] and [9]. There are other systems that are powerful and versatile with low level programming interfaces [10], [11]. The problem with them is that they are specific and cannot provide general high level programming interface, scheduling and other needed mechanisms. Making models and working with models is found in Pregel [12] which is graph-centric platform. It supports partitioning of models with in-built scheduling and mechanisms for consistency. However there is no realization of its widespread usage. Lohr [13] opined that big data bring new possibilities with machine learning algorithms. Moreover the big data is a wealth for making strategies in business. For instance Google is using such data in order to drive its business. Chen et al. [14] explored business analytics and intelligence on big data. Management information systems are improved with big data science. Jacobs [15] opined that organizations need to be careful about the pathologies of big data and carefully consider what exactly big data is. Chen et al. [16] explored the possibilities of big data in future with a good survey of articles on big data. They opined that big data can add big value to businesses when harnessed properly. Labrinidis and Jagadish [17] investigated opportunities with big data such as ability to obtain comprehensive business intelligence. They also found many challenges such as environment, algorithms, dealing with different kinds of data and so on. Kraska [18] discussed about the increasing need of big data and its processing which looks like finding a needle in haystack. Agrawal et al. [19] explored the current state of the art on big data besides future opportunities. They focused on scalable DBMS that can help in processing big data. Snijders et al. [20] opined that knowledge gaps are filed with big data processing. Herodotou et al. [21] explored on big data and found that it can help in agility and depth in information processing and obtaining business intelligence. Cuzzocrea et al. [22] focused on big data and explored how big data revolution can work with multi-dimensional data for business intelligence. Chen and Zhang [23] studied big data
  • 5. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 19 process in the context of data-intensive applications to find methodologies that can help in processing big data. Katal et al. [24] discussed on the tools already available for processing big data. The tools like Hadoop and other related products help in big data analytics. Bizer et al. [25] understood that four perspectives such as defining the problem, searching (processing), transforming, and entity resolution are important to obtain meaningful information. In this paper we proposed a generic framework that supports big data processing with machine learning algorithms. It also supports scheduling jobs in multi-user environments. 3. PROPOSED FRAMEWORK FOR DISTRIBUTED MACHINE LEARNING We proposed a machine learning framework which is generic in nature with high level programming interface. It is meant for processing big data in multi-user environment. It provides common operations need to process unstructured data. Besides it supports algorithm for scheduling multiple jobs of users concurrently in a distributed environment. In fact the framework can support any machine learning algorithm for clustering documents. It supports information retrieval and natural language processing with machine learning to understand big data and perform clustering of documents. Figure 5: Generic Framework for Machine Learning on Big Data The framework shown in Figure 5 is generic in nature and works with any set of documents that constitute big data. The framework takes big data as input and produces clusters. The operations involved in the framework include keyword identification, construction of feature space, computing TF/IDF, similarity computation and clustering. Keywords are the important words identified. The feature space is constructed in order to process the data. Afterwards, TF/IDF measure is used to find statistics based on term frequency. This will help in finding similarity between documents in order to make well-informed decisions on clustering. There are many distance measures that can be used for finding similarities between documents. They are as follows.
  • 6. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 20 Listing 1: Shows Different Functions that are used as Similarity Measures The listing 1 shows many similarity measures such as Jaccard function, cosine function, Euclidean Distance, Extended Jacccard Function and Dice function. All the functions are capable of supporting the similarity computation in big data processing. Especially in this paper they are used for finding similarity between two documents. The similarity measure results in a value between 0.0 to 1.0. The more this value is the more the similarity is between any two documents. Here is the proposed algorithm for multi-user job scheduling. Algorithm 1: Multi-User Job Scheduling Algorithm
  • 7. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October The algorithm is responsible to schedule jobs in order to improve the productivity of big data processing. The distributed machine learning schedules them properly in such a way that they are processed in an optimi waiting time concept is used to ensure that the jobs are given their turn and processing of big data is carried out in efficient manner. 4. EXPERIMENTAL RESULTS We built a custom simulator (prototype application) that simulates distributed programming framework such as Hadoop with Map and Reduce functionalities. It supports multiple nodes in the processing and multiple jobs provided by many users simultaneously. T with the application in terms of processing Big Data (documents) which is in unstructured format. The results are observed in terms of number of concurrent users and the performance of the proposed framework in accurate clustering of with ground truth. No. of Users 10 Time (sec) 1 Table 1: Shows Performance in Terms o As shown in Table 1, there is relation between number of users and the time taken to process big data. In the simulated environment performance in terms of time taken. also caused more time to be taken. Figure 6: Clustering As shown in Figure 6, it is evident that horizontal axis represents time taken in order results reveals a fact that when number of users is increased the time taken for processing big data is also increased proportionately. the time taken for 10 users is 1 second. 0 2 4 6 8 10 12 14 16 18 10 Time(sec) International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October The algorithm is responsible to schedule jobs in order to improve the productivity of big data processing. The distributed machine learning environment takes number of user jobs and schedules them properly in such a way that they are processed in an optimized fashion. The waiting time concept is used to ensure that the jobs are given their turn and processing of big data manner. ESULTS We built a custom simulator (prototype application) that simulates distributed programming framework such as Hadoop with Map and Reduce functionalities. It supports multiple nodes in the processing and multiple jobs provided by many users simultaneously. The framework is tested with the application in terms of processing Big Data (documents) which is in unstructured format. observed in terms of number of concurrent users and the performance of the proposed framework in accurate clustering of documents. The clustered documents are evaluated 20 30 40 50 60 70 80 90 100 2.5 5 7.5 10 12.5 13.5 14 16 17 Performance in Terms of Workload VS. Time Taken As shown in Table 1, there is relation between number of users and the time taken to process big data. In the simulated environment multiple users and their jobs are considered performance in terms of time taken. The results revealed that when number of users increase, it more time to be taken. Clustering Performance in Multi-User Environment , it is evident that number of users is represented by vertical axis while the time taken in order to perform clustering of big data. The a fact that when number of users is increased the time taken for processing big data is also increased proportionately. When number of users is 100 the time taken is 17 seconds while the time taken for 10 users is 1 second. 10 20 30 40 50 60 70 80 90 100 Workload (No. of Users) International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 21 The algorithm is responsible to schedule jobs in order to improve the productivity of big data takes number of user jobs and zed fashion. The waiting time concept is used to ensure that the jobs are given their turn and processing of big data We built a custom simulator (prototype application) that simulates distributed programming framework such as Hadoop with Map and Reduce functionalities. It supports multiple nodes in he framework is tested with the application in terms of processing Big Data (documents) which is in unstructured format. observed in terms of number of concurrent users and the performance of the The clustered documents are evaluated As shown in Table 1, there is relation between number of users and the time taken to process big considered to measure umber of users increase, it number of users is represented by vertical axis while the to perform clustering of big data. The trend in the a fact that when number of users is increased the time taken for processing big data seconds while
  • 8. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 5. EVALUATION Human experts are involved in evaluating the work of the proposed framework. The clustering of big is done manually by spending time on a portion of big data in order to establish ground truth. The ground truth is then compared with the performance of the system. Thus the proposed framework with underlying algorithm is evaluated. Table As shown in Figure 7, the proposed framework performance is presented in the form of true positives and false positives. The system has showed 95% true positives and This is carried out based on the ground truth values provided by human experts. The results revealed that the proposed system can be used effectively for processing big data. needs further refinement in order to make it more Figure The results shown in Figure 8 reveal that there is performance difference between tag search and TF/IDF based search. percentage performance is shown in vertical axis. There is slight difference between the two kinds of searches in performance. 0 20 40 60 80 100 Percentage 0% 20% 40% 60% 80% 100% Top-5 International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October Human experts are involved in evaluating the work of the proposed framework. The clustering of by spending time on a portion of big data in order to establish ground truth. The ground truth is then compared with the performance of the system. Thus the proposed algorithm is evaluated. True Positives False Positives 95 5 Table 1: Shows Results of Evaluation Figure 7: Evaluation Results , the proposed framework performance is presented in the form of true The system has showed 95% true positives and 5% false positives. This is carried out based on the ground truth values provided by human experts. The results revealed that the proposed system can be used effectively for processing big data. needs further refinement in order to make it more useful and add value to enterprises. Figure 8: TF/IDF and Tab Based Search The results shown in Figure 8 reveal that there is performance difference between tag search and TF/IDF based search. The top-n values are represented by horizontal axis while the percentage performance is shown in vertical axis. There is slight difference between the two kinds of searches in performance. True Positives False Positives Measures Top-10 Top-15 Tag-based Search TF/IDF International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 22 Human experts are involved in evaluating the work of the proposed framework. The clustering of by spending time on a portion of big data in order to establish ground truth. The ground truth is then compared with the performance of the system. Thus the proposed , the proposed framework performance is presented in the form of true 5% false positives. This is carried out based on the ground truth values provided by human experts. The results revealed that the proposed system can be used effectively for processing big data. However, it useful and add value to enterprises. The results shown in Figure 8 reveal that there is performance difference between tag-based n values are represented by horizontal axis while the percentage performance is shown in vertical axis. There is slight difference between the two
  • 9. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 23 6. CONCLUSIONS AND FUTURE WORK In this paper we proposed a framework that supports distributed programming. The framework supports processing of big data with its pre-defined building blocks. It supports machine learning approach with natural language processing in order to perform clustering of given big data (documents). It makes use of Map Reduce programming paradigm in a simulated environment. The framework has provision for generic functionalities that can be reused by cloud users in the real world. Unstructured data is subjected to machine learning in order to obtain intelligence which is used to determine clusters. Recently emerged domain in the information technologies is Big Data which refers to data with characteristics such as volume, velocity and variety. The existing machine learning approaches cannot cope with Big Data. In this paper our framework with underlying algorithm supports scheduling jobs of multiple users concurrently. We built a custom simulator that demonstrates processing of big data in distributed environment. Our empirical results revealed that the performance of proposed framework is high in terms of accuracy when compared with that of ground truth. This research can be extended further to improve the framework to support different kinds of big data besides supporting multiple data mining operations on big data. REFERENCES [1] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel coordinate descent for l1-regularized loss minimization,” in Proc. Int. Conf. Mach. Learn., 2011, pp. 321–328. [2] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 873–881. [3] M. Zinkevich, J. Langford, and A. J. Smola, “Slow learners are fast,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 2331–2339. [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale distributed deep networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1232–1240. [5] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” J. Mach. Learn. Res., vol. 14, pp. 1303–1347, 2013. [6] S. A. Williamson, A. Dubey, and E. P. Xing, “Parallel Markov chain Monte Carlo for nonparametric mixture models,” in Proc.Int. Conf. Mach. Learn., 2013, pp. 98–106. [7] T. White, Hadoop: The Definitive Guide. Sebastopol, CA, USA: O’Reilly Media, 2012 [8] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein, “Distributed GraphLab: A framework for machine learning and data mining in the cloud,” in Proc. VLDB Endowment, vol. 5, pp. 716–727, 2012. [9] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, p. 10. [10] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in Proc. 11th USENIX Conf. Operating Syst. Des. Implementation, 2014, pp. 583–598. [11] R. Power and J. Li, “Piccolo: Building fast, distributed programs with partitioned tables,” in Proc. USENIX Conf. Operating Syst. Des. Implementation, article 10, 2010, pp. 1–14. [12] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 135–146. [13] STEVE LOHR. (2012). The Age of Big Data. Big Data’s Impact in the World, p1-5. [14] Hsinchun Chen,Roger H. L. Chiang and Veda C. Storey. (2012). BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACT. MIS Quarterly. 36 (4), p1165-1188. [15] Adam Jacobs. (2009). The Pathologies of Big Data. communicati ons of the acm. 52 (8), p1-9. [16] Min Chen ,Shiwen Mao and Yunhao Liu. (2014). Big Data: A Survey.Springer Science+Business Media, p1-39. [17] Alexandros Labrinidis and H. V. Jagadish. (2012). Challenges and Opportunities with Big Data. Proceedings of the VLDB Endowment. 5 (12), p1-2. [18] Tim Kraska. (2013). Finding the Needle in the Big Data Systems Haystack. IEEE Computer Society, p1-3.
  • 10. International Journal of Database Management Systems ( IJDMS ) Vol.8, No.5, October 2016 24 [19] Divyakant Agrawal,Sudipto Das and Amr El Abbadi. (2011). Big Data and Cloud Computing: Current State and Future Opportunities. ACM, p1-4. [20] Chris Snijders, Uwe Matzat and Ulf-Dietrich Reips. (2012). “Big Data”: Big Gaps of Knowledge in the Field of Internet Science. International Journal of Internet Science. 7 (1), p1-5. [21] Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin and Shivnath Babu. (2011). Starfish: A Selftuning System for Big Data Analytics. Biennial Conference on Innovative Data Systems Research, p1-12. [22] Alfredo Cuzzocrea,Il-Yeol Song and Karen C. Davis. (2011). Analytics over Large-Scale Multidimensional Data: The Big Data Revolution.ACM, p1-3. [23] C.L. Philip Chen and Chun-Yang Zhang. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, p1-34. [24] Avita Katal,Mohammad Wazid and R H Goudar. (2013). Big Data: Issues, Challenges, Tools and Good Practices. IEEE. . (.), p1-6. [25] Christian Bizer, Peter Boncz, Michael L. Brodie and Orri Erling. (2011). The Meaningful Use of Big Data: Four Perspectives – Four Challenges.SIGMOD Record. 40 (4), p1-5. AUTHORS D.Krishna Madhuri Received her B.Tech degree in computer Science and engineering from Swarnandhra college of Engineering & Technology, Narsapur, Andhra Pradesh in 2009,and M.Tech Degree in Computer Science and Engineering from Sridevi Women’s Engineering college Hyderabad, Telangana in 2012. She is currently pursuing her Ph.D degree from Sri Satyasai University, Sehore, Madhya Pradesh. Currently she is working as Assistant Professor in the department of Computer Science and Engineering of Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana India. Herresearch interest includes Data Bases, Data Mining and Big Data.