SlideShare a Scribd company logo
1 of 10
Download to read offline
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
11
A Survey on Data Mining and Analysis in Hadoop and MongoDb
Manmitsinh C.Zala
Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India
E-mail manmit.zala@gmail.com
Prof. Jitendra S.Dhobi
Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India
E-mail: jsdhobi@gmail.com
Abstract:
Data Mining is a process to generate pattern and rules from various types of data marts and data warehouses ,in
this process there are several steps which contains data cleaning data anomaly detection then clean data is mined
with various approaches .In this research we have discussed data mining on large datasets ( Big Data) with this
large data set major issues are scalability and security ,Hadoop is the tool to mine the data and Mongo db
provides input for it, which is a key-value paradigm for parsing the data ,Other approaches are discussed with
this report and their capability for data storage ,Map reduce is method which can be used to reduce the data set
to reduce query processing time and improve system throughput, In the Proposed system we are going to mine
the big data this Hadoop and Mongo db and we will try to mine the data with sorted or double sorted key value
pair ,for and analyze the outcome of system.
Keywords- DataMIning , Hadoop, MapReduce, HDFS, MongoDb.
1. Introduction
“Big Data” is data whose scale, diversity, and complexity require new architecture, Techniques, algorithms, and
analytics to manage it and extract value and hidden knowledge from it.
Amount of data generated every day is expanding in drastic manner. Big data is a popular term used to
describe the data which is in zeta byte.[1] . Big Data is large amount of data. This vast amount of data is
generated by social media and networks, scientific instruments, mobile devices, sensor technology and
networks. Ability to manage, analyze, summarize, visualize, and discover knowledge from the collected
unstructured data in a timely manner and in a scalable fashion is very difficult task using traditional data mining
tools. To analyze the data Apache introduce a new technology called Hadoop. We can describe the
characteristics of big data using three Vs Volume, Variety and Velocity.[ 13][14]
Hadoop is the part of Apache projects, Hadoop software library is a framework that supports
distributed processing of large data across clusters of computers using simple programming models. Hadoop is
combination of Map Reduce and Hadoop distributed file system (HDFS). Work of Map Reduce is to process the
data and work of HDFS is to store the data into file system.
NOSQL [3][7] is the term related to “ Not Only Sql “ Sql is a relational database language but for big
data analysis these techniques are not enough so alternative solutions are NoSql databases like
Mongodb,Cassandra,Voldmort etc ..
2. Literature Survey
2.1 Applications
This proposed will provide a new approach analysis Big datamining with hadoop and mongodb which is based
on MapReduce Paradigm. This new approach will try to improve the computational time, more fault tolerance of
system and will handle or deal with Bigdata analysis.
2.2 Related Work
This chapter will provide information about the work done in big data mining and various approaches use and
method proposed
In [1] author has discussed the meaning and importance of big data analysis programming tool use for
big data mining and important of big data , with the example of facebook we can understood that today it is
required to process large number of data sets ,our traditional data sets are not enough for that ,for example
instead of taking large MySql tables we can use caching approach from memcached for n tier elements as Mysql
has very good performance in read but they are lagging in write ,which leads us to very high reliability but low
partition tolerance in our CAP model ,another example author has given is Yelp which uses AWS and Hadoop
for data analysis which uses Amazon S3 server to store large datasets which is RAID service.
The author proposed such data analysis using Apache Hadoop and JSON and data stored From
Amazon web services using their web services and analyze the data, the analysis showed that this method can
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
12
analyze the large data from different sources with minimum utilization of resources
In [2] In this paper author has utilized Nosql database Mongo db to implement the big data analysis as
it is advantageous over rigid sql tables which is not useful in today’s large scale data for web logs generated
every day .more over author has compared performance between Mongo db and HDFS frame work using inbuilt
map reduce method with mongo db , author has not defined the modern data store technology and integration
available with hadoop like Hbase , and HIVE for that experiments and results are shown for large amount of data
sets , this is the motive why we choose mongo db data store for
Large data sets .
Figure 2.1 [ HDFS –Mongo Db comparison]
With this framework proposed by author the output comparison shows that Figure shows the effect of
the split size on performance using mongo-hadoop. The number of input records is _ 9:3 million, or 4GB of
input data. With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size,
we are able to reduce this number to around 40 and achieve a considerable performance improvement. The curve
levels o_ between 128MB and 256MB, so we decided to use 128MB as the split size for the rest of our tests both
for native Hadoop-HDFS and mongo-hadoop.
In , MS At el.[3] has discussed various security issues and threats available with Big data as data is in
zeta byte size it also contains some sensitive and confidential information it is necessary to prevent unauthorized
use of data so apart from storage retrieve and processing security is also an important concern for data mining ,
data application from social web ,consumer oriented work has large impact on big data security according to
author vast use of smart phones has increased photo uploading and other sensitive information on web it is an
issue for that author has proposed metadata analysis in big data which creates an index of each images uploaded
on social web and we can identify from link which gives confidentiality over social media, so each images can
be scanned from big data bases of social media and can be apply for future security policies .
In [4] After considering security in analysis we again come with our problem of analysis the big data
with this paper integration of NOSQL with big data analysis author proposed model of unity architecture for
analysis of data as shown in figure 3.2
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
13
Figure 2.2 Unity Architecture
The objectives of this architecture is as follow
• SQL is a declarative language that allows descriptive queries while hiding implementation and query execution
details.
• SQL is a standardized language allowing portability between systems and leveraging a massive existing
knowledge base of database developers.
• Supporting SQL allows a NoSQL system to seamlessly interact with other enterprise systems that use SQL and
JDBC/ODBC without requiring changes.
This system provides combination of both Relational data base system and Nosql system for this
interaction we can translate one schema to another schema by JDBC API and Mongo db connector .
In [6] Mongo db and Oracle databases are compared by their storage method ,syntax and their retrieval
methods also various experiments conducted with different query processing time and number of processing the
results here we are discussing few results achieved with this research .
Figure 2.3. Insert query with Mongo db and Oracle[6]
As we can see for small records inserting oracle databases are faster then mongo db but as the size increases for
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
14
records the mongo db is impressively ahead then Oracle database
Same results are achieved with update query comparison
Figure 2.4 Update query on records and time comparison Mongo db vs Oracle
From this we can conclude that mongodb is flexible and scalable for large data sets which provides
batter integration for data storage and retrieval .
In [5] In this paper author has discussed about some very important parameters of mongo db focusing
on CAP model and compared various types of data store available with no Nosql and tested them among various
business intelligent system provided , and concluded that Nosql data stores provides huge opportunity where Sql
data bases are not useful basic advantages are their scalability and cross node operation .the intersection
algorithm for mongo db states the effectiveness of mongo db data store for key value approach to modelize the
data.
In [7] [10] and [11] some practical approaches are shown to interact no sql data stores with various
systems such as distributed architecture[7] ,Hashdoop[11] and evolution in hadoop [10] are proposed in
distributed system data bases are handled by structure system but it fails when data items increase so
unstructured data stores are useful for such problems some major industries are capable to develop their own
unstructured data stores for ex. Google’s Big table, Yahoo’s PNUTS ,Hadoop’s Hbase and many more but what
about small industries ,author stated that there are many open source products are available to handle such data
the comparison between them is shown in below figure
Figure 2.5 Comparison of Difference Data stores [7]
Among all this data stores mongo db is better replacement for MySql as it is semi structured, and
provides batter joins contains laser time for searching and in performing other queries.
In paper [10] multiple Nosql data stores are compared and we can see that mongo db provides
consistency , partition tolerance and crash handling over any other data stores
But in this paper author has limited computation power this system can be improve by adding some
more computation power over large datasets by cloud computing or distributor approach .[11] is an example of
hadoop hash function for anomaly detection using map reduce programming model .the hashdoop framework
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
15
splits the traffic using has functions and the detector detects the anomaly from carious hadoop clusters then the
traffic has been divided in less traffic lines how ever author has not applied to store the data back to original data
sets which will be lost vice versa.
3. Background Study
3.1 Big Data
3.1.1 Architecture
Figure 3.1.1. Bigdata [ 9]
Big data is a distributed architecture for storing large amount of data ,According to a research recently the online
data has increased in size CERN research says that data without operating online. For example, “will produce
roughly 15 peta bytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer
DVDs a year!” [11 ]
Bigdata architecture consists following three segments
• Storage System
• Processing
• Analysis
Figure 3.1.1 BigData system [9]
3.2 What is NOSQL?
No SQL means an alternative of traditional database system the term was generated ny a scientist named Eric Evans
in Sanfransisco. The nosql databases have variety of different database systems and they provides the data
manipulation as well as low time in reading and writing
Many large organizations have their own NoSql Databases as BIG Table in Google which has much effect
on the no sql The whole point is that they provides alternatives to the traditional databases product. For Example the
many nosql products are available in the market and they are widely used by many companies .
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
16
3.2.1 Mongo DB
MongoDb is the nosql type database which is document oriented , the organization which developed it is
10gen ,mongo word means large size it is very fast and reliable it is written in C++ it is also used to store large
files over a distribute location
Also we can store binary data like images , videos , mp3 files.
3.2.1.1: Query Model
Queries for MongoDB are bidding in a JSON like syntax and are forward to MongoDB as BSON altar by the
database driver. The concern archetypal of MongoDB allows queries over all abstracts central a collection,
including the anchored altar and arrays. Through the acceptance of predefined indexes queries can dynamically
be formulated during runtime.
Not all aspects of a concern are formulated aural a concern accent in MongoDB, depending on the MongoDB
disciplinarian for a programming language, some things may be bidding through the abracadabra of a adjustment
of the driver.
{ “employees":[{ "firstName":"Manmit","lastName":"Zala" },
{"firstName":"Pradip" , "lastName":"Chavda" },{"firstName":"Nilay", "lastName":"Parekh" }]} The query
model supports the following features:
1. Queries over documents and embedded subdocuments
2. Comparators (<;_;_;>)
3. Conditional operators (equals, not equals, exists, in, not in ...)
4. Logical perators: AND
5. Sorting by multiple Attributes
6. Group by
3.2.1.2: Sharding
Mongodb harding means the components of mongodb it is also uses following components :
1) Configuration servers
2 ) Shard nodes
3) Services for Routing
These are known as mongos .
3.3Apache Hadoop
Apache Hadoop is java based programming framework which is used for processing large data sets in distributed
computer environment. Hadoop is used in system where multiple nodes are present which can process terabytes
of data hadoop uses its own file system HDFS which facilitates fast transfer of data which can sustain node
failure and avoid system failure as whole.[1]
3.3.1 Architecutre
Figure 3.3.1 HDFS ARCHITECTURE [14]
3.4 Map Reduce Algorithm and Approaches
Map/Reduce is a programming paradigm that was made popular by Google where a task is divided into small
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
17
portions and distributed to a large number of nodes for processing (map), and the results are then summarized
into the final answer (reduce). Hadoop also uses Map/Reduce for data processing. Hence different functions for
the processing are written in the form of Hadoop job. A Hadoop job consists of mapper and reducer functions
framework i.e. STS is used as the integrated development environment (IDE) [1].
3.4.1Map reduce with Hadoop
A .Many company uses Hadoop for big data analysis ,For example facebook use HIVE with Hadoop [1]
B. Yelp :uses AWS and Hadoop Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB
of logs per day. The company also uses Amazon Elastic Map Reduce to power approximately 20 separate batch
scripts, most of those processing the logs.
Features powered by Amazon ElasticMapReduce include:[1]
1. People Who Viewed this Also Viewed
2. Review highlights
3. Auto complete as you type on search
4. Search spelling suggestions
5. Top searches
3.4.2 Map Reduce using MongoDb
Mongo db uses key value pair type of storage The Map primitive consists in processing a data list in order to
create key/value pairs. Then, the Reduce primitive will process each pair in order to create new aggregated
key/value pairs.[5]
Example[5]: map(k1, v1) = list(k2, v2) ………...(1)
reduce(k2, list(v2)) = list(v3) …………………...(2)
List : (a; 2)(a; 4)(b; 4)(c; 5)(b; 2)(a; 1)………… (3)
After mapping : (a; [2, 4, 1]), (b; [4, 2]), (c[5]) …(4)
After reducing : (a; 7), (b; 6), (c; 5) …………….(5)
Equations (1) , (2) show both map and reduce primitives.
Figure 3.5.2 Map reduce example[5]
4. Conclusion
Now a day data increases day by day the storage, retrieval and analysis of bigdata in structured databases like
Oracle and Mysql is not possible so we have presented many Nosql system among them Mongo db is preferable
for as an alternate for Mysql , still it is an Active search are for data mining to mine knowledge from bigdata.In
future we are interested in batter method and system for efficient mining of bigdata.
5. References
[1] Jyoti Nandimath , Ankur Patil , Ekata Banerjee , Pratima Kakade :”Big Data Analysis using Apache
Hadoop “ In SKNCOE Pune India,2013
[2] E. Dede, M. Govindaraju ,D. Gunter, R. Canon, L. Ramakrishnan “Performance Evaluation of a MongoDB
and HadoopPlatform for Scientific Data Analysis” In Lawrence Berekely National Lab Berkeley, CA 94720
[3] Matthew Smith, Christian Szongott,Benjamin Henne, Gabriele von Voigt” Big Data Privacy Issues in Public
Social Media”,2013
[4] Ramon Lawrence :“Integration and Virtualization of Relational SQL and NoSQL Systems including MySQL
and MongoDB”At 2014 International Conference on Computational Science and Computational
Intelligence,2014
[5] Laurent Bonne, Anne Laurent, Michel Sala,Benedicte Laurent,Nicolas Sicard:” REDUCE, YOU SAY: What
NoSQL can do for Data Aggregation and BI in Large Repositories “In 2011 22nd International Workshop on
Database and Expert Systems Applications,2011
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
18
[6] Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin” MongoDB vs Oracle - database comparison
“2012 Third International Conference on Emerging Intelligent Data and Web Technologies,2012
[7] Suyog S. Nyati, Shivanand Pawar ,Rajesh Ingle: ” Performance Evaluation of Unstructured NoSQL data over
distributed framework” 2013 International Conference on Advances in Computing, Communications and
Informatics (ICACCI),2013
[8] Lior Okman,Nurit Gal-Oz,, Yaron Gonen,, Ehud Gudes ,Jenny Abramov:” Security Issues in NoSQL
Databases” 2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11,2011
[9] Chanchal Yadav, Shuliang Wang , Manoj Kumar: Algorithm and approaches to handle large Data- A Survey.
IJCSN International Journal of Computer Science and Network, Vol 2, Issue 3, 2013
[10] Ruxandra Burtica, Eleonora Maria Mocanu, Mugurel Ionut Andreica,
Nicolae Tapus: Practical application and evaluation of no-SQL databases in Cloud Computing , ©2012 IEEE
[11] Romain Fontugne,,Johan Mazel, Kensuke FukudaHashdoop: A MapReduce Framework for
NetworkAnomaly DetectionCERN – European organization for nuclear Research 2014 IEEE INFOCOM
Workshops: 2014 IEEE INFOCOM Workshop on Security and Privacy in Big Data,2014
WEBREFERENCES
[12] Bigdata-Wikipedia : http://en.wikipedia.org/wiki/Big_data
[13] Big Data Characteristics : http://en.wikipedia.org/wiki/Big_data#Characteristics
[14]Hadoop Architecture:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
BOOKS
[15]By Jiawei Han And Micheline Kamber,Data Mining Concept and Techniques, Copyright 2006, Second
Edition,pp 5-9 ,119-145.
Business, Economics, Finance and Management Journals PAPER SUBMISSION EMAIL
European Journal of Business and Management EJBM@iiste.org
Research Journal of Finance and Accounting RJFA@iiste.org
Journal of Economics and Sustainable Development JESD@iiste.org
Information and Knowledge Management IKM@iiste.org
Journal of Developing Country Studies DCS@iiste.org
Industrial Engineering Letters IEL@iiste.org
Physical Sciences, Mathematics and Chemistry Journals PAPER SUBMISSION EMAIL
Journal of Natural Sciences Research JNSR@iiste.org
Journal of Chemistry and Materials Research CMR@iiste.org
Journal of Mathematical Theory and Modeling MTM@iiste.org
Advances in Physics Theories and Applications APTA@iiste.org
Chemical and Process Engineering Research CPER@iiste.org
Engineering, Technology and Systems Journals PAPER SUBMISSION EMAIL
Computer Engineering and Intelligent Systems CEIS@iiste.org
Innovative Systems Design and Engineering ISDE@iiste.org
Journal of Energy Technologies and Policy JETP@iiste.org
Information and Knowledge Management IKM@iiste.org
Journal of Control Theory and Informatics CTI@iiste.org
Journal of Information Engineering and Applications JIEA@iiste.org
Industrial Engineering Letters IEL@iiste.org
Journal of Network and Complex Systems NCS@iiste.org
Environment, Civil, Materials Sciences Journals PAPER SUBMISSION EMAIL
Journal of Environment and Earth Science JEES@iiste.org
Journal of Civil and Environmental Research CER@iiste.org
Journal of Natural Sciences Research JNSR@iiste.org
Life Science, Food and Medical Sciences PAPER SUBMISSION EMAIL
Advances in Life Science and Technology ALST@iiste.org
Journal of Natural Sciences Research JNSR@iiste.org
Journal of Biology, Agriculture and Healthcare JBAH@iiste.org
Journal of Food Science and Quality Management FSQM@iiste.org
Journal of Chemistry and Materials Research CMR@iiste.org
Education, and other Social Sciences PAPER SUBMISSION EMAIL
Journal of Education and Practice JEP@iiste.org
Journal of Law, Policy and Globalization JLPG@iiste.org
Journal of New Media and Mass Communication NMMC@iiste.org
Journal of Energy Technologies and Policy JETP@iiste.org
Historical Research Letter HRL@iiste.org
Public Policy and Administration Research PPAR@iiste.org
International Affairs and Global Strategy IAGS@iiste.org
Research on Humanities and Social Sciences RHSS@iiste.org
Journal of Developing Country Studies DCS@iiste.org
Journal of Arts and Design Studies ADS@iiste.org
The IISTE is a pioneer in the Open-Access hosting service and academic event management.
The aim of the firm is Accelerating Global Knowledge Sharing.
More information about the firm can be found on the homepage:
http://www.iiste.org
CALL FOR JOURNAL PAPERS
There are more than 30 peer-reviewed academic journals hosted under the hosting platform.
Prospective authors of journals can find the submission instruction on the following
page: http://www.iiste.org/journals/ All the journals articles are available online to the
readers all over the world without financial, legal, or technical barriers other than those
inseparable from gaining access to the internet itself. Paper version of the journals is also
available upon request of readers and authors.
MORE RESOURCES
Book publication information: http://www.iiste.org/book/
IISTE Knowledge Sharing Partners
EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open
Archives Harvester, Bielefeld Academic Search Engine, Elektronische Zeitschriftenbibliothek
EZB, Open J-Gate, OCLC WorldCat, Universe Digtial Library , NewJour, Google Scholar

More Related Content

What's hot

Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technologypeertechzpublication
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoopdbpublications
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSIJEACS
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
 

What's hot (16)

Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
disertation
disertationdisertation
disertation
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
E018142329
E018142329E018142329
E018142329
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technology
 
big data
big databig data
big data
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association Mining
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
No sql database
No sql databaseNo sql database
No sql database
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 

Similar to A survey on data mining and analysis in hadoop and mongo db

Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
Bridging the gap between the semantic web and big data: answering SPARQL que...
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...IJECEIAES
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql DatabasesJAMES NGONDO
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSijdms
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
 
CS828 P5 Individual Project v101
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101ThienSi Le
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 

Similar to A survey on data mining and analysis in hadoop and mongo db (20)

Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
Bridging the gap between the semantic web and big data: answering SPARQL que...
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
 
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
 
Big Data
Big DataBig Data
Big Data
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql Databases
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
CS828 P5 Individual Project v101
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 

More from Alexander Decker

Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...Alexander Decker
 
A validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale inA validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale inAlexander Decker
 
A usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websitesA usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websitesAlexander Decker
 
A universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banksA universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banksAlexander Decker
 
A unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized dA unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized dAlexander Decker
 
A trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistanceA trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistanceAlexander Decker
 
A transformational generative approach towards understanding al-istifham
A transformational  generative approach towards understanding al-istifhamA transformational  generative approach towards understanding al-istifham
A transformational generative approach towards understanding al-istifhamAlexander Decker
 
A time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibiaA time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibiaAlexander Decker
 
A therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school childrenA therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school childrenAlexander Decker
 
A theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banksA theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banksAlexander Decker
 
A systematic evaluation of link budget for
A systematic evaluation of link budget forA systematic evaluation of link budget for
A systematic evaluation of link budget forAlexander Decker
 
A synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjabA synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjabAlexander Decker
 
A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...Alexander Decker
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalAlexander Decker
 
A survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniquesA survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniquesAlexander Decker
 
A survey on challenges to the media cloud
A survey on challenges to the media cloudA survey on challenges to the media cloud
A survey on challenges to the media cloudAlexander Decker
 
A survey of provenance leveraged
A survey of provenance leveragedA survey of provenance leveraged
A survey of provenance leveragedAlexander Decker
 
A survey of private equity investments in kenya
A survey of private equity investments in kenyaA survey of private equity investments in kenya
A survey of private equity investments in kenyaAlexander Decker
 
A study to measures the financial health of
A study to measures the financial health ofA study to measures the financial health of
A study to measures the financial health ofAlexander Decker
 
A study to evaluate the attitude of faculty members of public universities of...
A study to evaluate the attitude of faculty members of public universities of...A study to evaluate the attitude of faculty members of public universities of...
A study to evaluate the attitude of faculty members of public universities of...Alexander Decker
 

More from Alexander Decker (20)

Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...
 
A validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale inA validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale in
 
A usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websitesA usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websites
 
A universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banksA universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banks
 
A unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized dA unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized d
 
A trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistanceA trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistance
 
A transformational generative approach towards understanding al-istifham
A transformational  generative approach towards understanding al-istifhamA transformational  generative approach towards understanding al-istifham
A transformational generative approach towards understanding al-istifham
 
A time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibiaA time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibia
 
A therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school childrenA therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school children
 
A theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banksA theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banks
 
A systematic evaluation of link budget for
A systematic evaluation of link budget forA systematic evaluation of link budget for
A systematic evaluation of link budget for
 
A synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjabA synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjab
 
A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
A survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniquesA survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniques
 
A survey on challenges to the media cloud
A survey on challenges to the media cloudA survey on challenges to the media cloud
A survey on challenges to the media cloud
 
A survey of provenance leveraged
A survey of provenance leveragedA survey of provenance leveraged
A survey of provenance leveraged
 
A survey of private equity investments in kenya
A survey of private equity investments in kenyaA survey of private equity investments in kenya
A survey of private equity investments in kenya
 
A study to measures the financial health of
A study to measures the financial health ofA study to measures the financial health of
A study to measures the financial health of
 
A study to evaluate the attitude of faculty members of public universities of...
A study to evaluate the attitude of faculty members of public universities of...A study to evaluate the attitude of faculty members of public universities of...
A study to evaluate the attitude of faculty members of public universities of...
 

A survey on data mining and analysis in hadoop and mongo db

  • 1. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 11 A Survey on Data Mining and Analysis in Hadoop and MongoDb Manmitsinh C.Zala Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India E-mail manmit.zala@gmail.com Prof. Jitendra S.Dhobi Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India E-mail: jsdhobi@gmail.com Abstract: Data Mining is a process to generate pattern and rules from various types of data marts and data warehouses ,in this process there are several steps which contains data cleaning data anomaly detection then clean data is mined with various approaches .In this research we have discussed data mining on large datasets ( Big Data) with this large data set major issues are scalability and security ,Hadoop is the tool to mine the data and Mongo db provides input for it, which is a key-value paradigm for parsing the data ,Other approaches are discussed with this report and their capability for data storage ,Map reduce is method which can be used to reduce the data set to reduce query processing time and improve system throughput, In the Proposed system we are going to mine the big data this Hadoop and Mongo db and we will try to mine the data with sorted or double sorted key value pair ,for and analyze the outcome of system. Keywords- DataMIning , Hadoop, MapReduce, HDFS, MongoDb. 1. Introduction “Big Data” is data whose scale, diversity, and complexity require new architecture, Techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Amount of data generated every day is expanding in drastic manner. Big data is a popular term used to describe the data which is in zeta byte.[1] . Big Data is large amount of data. This vast amount of data is generated by social media and networks, scientific instruments, mobile devices, sensor technology and networks. Ability to manage, analyze, summarize, visualize, and discover knowledge from the collected unstructured data in a timely manner and in a scalable fashion is very difficult task using traditional data mining tools. To analyze the data Apache introduce a new technology called Hadoop. We can describe the characteristics of big data using three Vs Volume, Variety and Velocity.[ 13][14] Hadoop is the part of Apache projects, Hadoop software library is a framework that supports distributed processing of large data across clusters of computers using simple programming models. Hadoop is combination of Map Reduce and Hadoop distributed file system (HDFS). Work of Map Reduce is to process the data and work of HDFS is to store the data into file system. NOSQL [3][7] is the term related to “ Not Only Sql “ Sql is a relational database language but for big data analysis these techniques are not enough so alternative solutions are NoSql databases like Mongodb,Cassandra,Voldmort etc .. 2. Literature Survey 2.1 Applications This proposed will provide a new approach analysis Big datamining with hadoop and mongodb which is based on MapReduce Paradigm. This new approach will try to improve the computational time, more fault tolerance of system and will handle or deal with Bigdata analysis. 2.2 Related Work This chapter will provide information about the work done in big data mining and various approaches use and method proposed In [1] author has discussed the meaning and importance of big data analysis programming tool use for big data mining and important of big data , with the example of facebook we can understood that today it is required to process large number of data sets ,our traditional data sets are not enough for that ,for example instead of taking large MySql tables we can use caching approach from memcached for n tier elements as Mysql has very good performance in read but they are lagging in write ,which leads us to very high reliability but low partition tolerance in our CAP model ,another example author has given is Yelp which uses AWS and Hadoop for data analysis which uses Amazon S3 server to store large datasets which is RAID service. The author proposed such data analysis using Apache Hadoop and JSON and data stored From Amazon web services using their web services and analyze the data, the analysis showed that this method can
  • 2. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 12 analyze the large data from different sources with minimum utilization of resources In [2] In this paper author has utilized Nosql database Mongo db to implement the big data analysis as it is advantageous over rigid sql tables which is not useful in today’s large scale data for web logs generated every day .more over author has compared performance between Mongo db and HDFS frame work using inbuilt map reduce method with mongo db , author has not defined the modern data store technology and integration available with hadoop like Hbase , and HIVE for that experiments and results are shown for large amount of data sets , this is the motive why we choose mongo db data store for Large data sets . Figure 2.1 [ HDFS –Mongo Db comparison] With this framework proposed by author the output comparison shows that Figure shows the effect of the split size on performance using mongo-hadoop. The number of input records is _ 9:3 million, or 4GB of input data. With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, we are able to reduce this number to around 40 and achieve a considerable performance improvement. The curve levels o_ between 128MB and 256MB, so we decided to use 128MB as the split size for the rest of our tests both for native Hadoop-HDFS and mongo-hadoop. In , MS At el.[3] has discussed various security issues and threats available with Big data as data is in zeta byte size it also contains some sensitive and confidential information it is necessary to prevent unauthorized use of data so apart from storage retrieve and processing security is also an important concern for data mining , data application from social web ,consumer oriented work has large impact on big data security according to author vast use of smart phones has increased photo uploading and other sensitive information on web it is an issue for that author has proposed metadata analysis in big data which creates an index of each images uploaded on social web and we can identify from link which gives confidentiality over social media, so each images can be scanned from big data bases of social media and can be apply for future security policies . In [4] After considering security in analysis we again come with our problem of analysis the big data with this paper integration of NOSQL with big data analysis author proposed model of unity architecture for analysis of data as shown in figure 3.2
  • 3. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 13 Figure 2.2 Unity Architecture The objectives of this architecture is as follow • SQL is a declarative language that allows descriptive queries while hiding implementation and query execution details. • SQL is a standardized language allowing portability between systems and leveraging a massive existing knowledge base of database developers. • Supporting SQL allows a NoSQL system to seamlessly interact with other enterprise systems that use SQL and JDBC/ODBC without requiring changes. This system provides combination of both Relational data base system and Nosql system for this interaction we can translate one schema to another schema by JDBC API and Mongo db connector . In [6] Mongo db and Oracle databases are compared by their storage method ,syntax and their retrieval methods also various experiments conducted with different query processing time and number of processing the results here we are discussing few results achieved with this research . Figure 2.3. Insert query with Mongo db and Oracle[6] As we can see for small records inserting oracle databases are faster then mongo db but as the size increases for
  • 4. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 14 records the mongo db is impressively ahead then Oracle database Same results are achieved with update query comparison Figure 2.4 Update query on records and time comparison Mongo db vs Oracle From this we can conclude that mongodb is flexible and scalable for large data sets which provides batter integration for data storage and retrieval . In [5] In this paper author has discussed about some very important parameters of mongo db focusing on CAP model and compared various types of data store available with no Nosql and tested them among various business intelligent system provided , and concluded that Nosql data stores provides huge opportunity where Sql data bases are not useful basic advantages are their scalability and cross node operation .the intersection algorithm for mongo db states the effectiveness of mongo db data store for key value approach to modelize the data. In [7] [10] and [11] some practical approaches are shown to interact no sql data stores with various systems such as distributed architecture[7] ,Hashdoop[11] and evolution in hadoop [10] are proposed in distributed system data bases are handled by structure system but it fails when data items increase so unstructured data stores are useful for such problems some major industries are capable to develop their own unstructured data stores for ex. Google’s Big table, Yahoo’s PNUTS ,Hadoop’s Hbase and many more but what about small industries ,author stated that there are many open source products are available to handle such data the comparison between them is shown in below figure Figure 2.5 Comparison of Difference Data stores [7] Among all this data stores mongo db is better replacement for MySql as it is semi structured, and provides batter joins contains laser time for searching and in performing other queries. In paper [10] multiple Nosql data stores are compared and we can see that mongo db provides consistency , partition tolerance and crash handling over any other data stores But in this paper author has limited computation power this system can be improve by adding some more computation power over large datasets by cloud computing or distributor approach .[11] is an example of hadoop hash function for anomaly detection using map reduce programming model .the hashdoop framework
  • 5. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 15 splits the traffic using has functions and the detector detects the anomaly from carious hadoop clusters then the traffic has been divided in less traffic lines how ever author has not applied to store the data back to original data sets which will be lost vice versa. 3. Background Study 3.1 Big Data 3.1.1 Architecture Figure 3.1.1. Bigdata [ 9] Big data is a distributed architecture for storing large amount of data ,According to a research recently the online data has increased in size CERN research says that data without operating online. For example, “will produce roughly 15 peta bytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer DVDs a year!” [11 ] Bigdata architecture consists following three segments • Storage System • Processing • Analysis Figure 3.1.1 BigData system [9] 3.2 What is NOSQL? No SQL means an alternative of traditional database system the term was generated ny a scientist named Eric Evans in Sanfransisco. The nosql databases have variety of different database systems and they provides the data manipulation as well as low time in reading and writing Many large organizations have their own NoSql Databases as BIG Table in Google which has much effect on the no sql The whole point is that they provides alternatives to the traditional databases product. For Example the many nosql products are available in the market and they are widely used by many companies .
  • 6. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 16 3.2.1 Mongo DB MongoDb is the nosql type database which is document oriented , the organization which developed it is 10gen ,mongo word means large size it is very fast and reliable it is written in C++ it is also used to store large files over a distribute location Also we can store binary data like images , videos , mp3 files. 3.2.1.1: Query Model Queries for MongoDB are bidding in a JSON like syntax and are forward to MongoDB as BSON altar by the database driver. The concern archetypal of MongoDB allows queries over all abstracts central a collection, including the anchored altar and arrays. Through the acceptance of predefined indexes queries can dynamically be formulated during runtime. Not all aspects of a concern are formulated aural a concern accent in MongoDB, depending on the MongoDB disciplinarian for a programming language, some things may be bidding through the abracadabra of a adjustment of the driver. { “employees":[{ "firstName":"Manmit","lastName":"Zala" }, {"firstName":"Pradip" , "lastName":"Chavda" },{"firstName":"Nilay", "lastName":"Parekh" }]} The query model supports the following features: 1. Queries over documents and embedded subdocuments 2. Comparators (<;_;_;>) 3. Conditional operators (equals, not equals, exists, in, not in ...) 4. Logical perators: AND 5. Sorting by multiple Attributes 6. Group by 3.2.1.2: Sharding Mongodb harding means the components of mongodb it is also uses following components : 1) Configuration servers 2 ) Shard nodes 3) Services for Routing These are known as mongos . 3.3Apache Hadoop Apache Hadoop is java based programming framework which is used for processing large data sets in distributed computer environment. Hadoop is used in system where multiple nodes are present which can process terabytes of data hadoop uses its own file system HDFS which facilitates fast transfer of data which can sustain node failure and avoid system failure as whole.[1] 3.3.1 Architecutre Figure 3.3.1 HDFS ARCHITECTURE [14] 3.4 Map Reduce Algorithm and Approaches Map/Reduce is a programming paradigm that was made popular by Google where a task is divided into small
  • 7. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 17 portions and distributed to a large number of nodes for processing (map), and the results are then summarized into the final answer (reduce). Hadoop also uses Map/Reduce for data processing. Hence different functions for the processing are written in the form of Hadoop job. A Hadoop job consists of mapper and reducer functions framework i.e. STS is used as the integrated development environment (IDE) [1]. 3.4.1Map reduce with Hadoop A .Many company uses Hadoop for big data analysis ,For example facebook use HIVE with Hadoop [1] B. Yelp :uses AWS and Hadoop Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic Map Reduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon ElasticMapReduce include:[1] 1. People Who Viewed this Also Viewed 2. Review highlights 3. Auto complete as you type on search 4. Search spelling suggestions 5. Top searches 3.4.2 Map Reduce using MongoDb Mongo db uses key value pair type of storage The Map primitive consists in processing a data list in order to create key/value pairs. Then, the Reduce primitive will process each pair in order to create new aggregated key/value pairs.[5] Example[5]: map(k1, v1) = list(k2, v2) ………...(1) reduce(k2, list(v2)) = list(v3) …………………...(2) List : (a; 2)(a; 4)(b; 4)(c; 5)(b; 2)(a; 1)………… (3) After mapping : (a; [2, 4, 1]), (b; [4, 2]), (c[5]) …(4) After reducing : (a; 7), (b; 6), (c; 5) …………….(5) Equations (1) , (2) show both map and reduce primitives. Figure 3.5.2 Map reduce example[5] 4. Conclusion Now a day data increases day by day the storage, retrieval and analysis of bigdata in structured databases like Oracle and Mysql is not possible so we have presented many Nosql system among them Mongo db is preferable for as an alternate for Mysql , still it is an Active search are for data mining to mine knowledge from bigdata.In future we are interested in batter method and system for efficient mining of bigdata. 5. References [1] Jyoti Nandimath , Ankur Patil , Ekata Banerjee , Pratima Kakade :”Big Data Analysis using Apache Hadoop “ In SKNCOE Pune India,2013 [2] E. Dede, M. Govindaraju ,D. Gunter, R. Canon, L. Ramakrishnan “Performance Evaluation of a MongoDB and HadoopPlatform for Scientific Data Analysis” In Lawrence Berekely National Lab Berkeley, CA 94720 [3] Matthew Smith, Christian Szongott,Benjamin Henne, Gabriele von Voigt” Big Data Privacy Issues in Public Social Media”,2013 [4] Ramon Lawrence :“Integration and Virtualization of Relational SQL and NoSQL Systems including MySQL and MongoDB”At 2014 International Conference on Computational Science and Computational Intelligence,2014 [5] Laurent Bonne, Anne Laurent, Michel Sala,Benedicte Laurent,Nicolas Sicard:” REDUCE, YOU SAY: What NoSQL can do for Data Aggregation and BI in Large Repositories “In 2011 22nd International Workshop on Database and Expert Systems Applications,2011
  • 8. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 18 [6] Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin” MongoDB vs Oracle - database comparison “2012 Third International Conference on Emerging Intelligent Data and Web Technologies,2012 [7] Suyog S. Nyati, Shivanand Pawar ,Rajesh Ingle: ” Performance Evaluation of Unstructured NoSQL data over distributed framework” 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI),2013 [8] Lior Okman,Nurit Gal-Oz,, Yaron Gonen,, Ehud Gudes ,Jenny Abramov:” Security Issues in NoSQL Databases” 2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11,2011 [9] Chanchal Yadav, Shuliang Wang , Manoj Kumar: Algorithm and approaches to handle large Data- A Survey. IJCSN International Journal of Computer Science and Network, Vol 2, Issue 3, 2013 [10] Ruxandra Burtica, Eleonora Maria Mocanu, Mugurel Ionut Andreica, Nicolae Tapus: Practical application and evaluation of no-SQL databases in Cloud Computing , ©2012 IEEE [11] Romain Fontugne,,Johan Mazel, Kensuke FukudaHashdoop: A MapReduce Framework for NetworkAnomaly DetectionCERN – European organization for nuclear Research 2014 IEEE INFOCOM Workshops: 2014 IEEE INFOCOM Workshop on Security and Privacy in Big Data,2014 WEBREFERENCES [12] Bigdata-Wikipedia : http://en.wikipedia.org/wiki/Big_data [13] Big Data Characteristics : http://en.wikipedia.org/wiki/Big_data#Characteristics [14]Hadoop Architecture:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html BOOKS [15]By Jiawei Han And Micheline Kamber,Data Mining Concept and Techniques, Copyright 2006, Second Edition,pp 5-9 ,119-145.
  • 9. Business, Economics, Finance and Management Journals PAPER SUBMISSION EMAIL European Journal of Business and Management EJBM@iiste.org Research Journal of Finance and Accounting RJFA@iiste.org Journal of Economics and Sustainable Development JESD@iiste.org Information and Knowledge Management IKM@iiste.org Journal of Developing Country Studies DCS@iiste.org Industrial Engineering Letters IEL@iiste.org Physical Sciences, Mathematics and Chemistry Journals PAPER SUBMISSION EMAIL Journal of Natural Sciences Research JNSR@iiste.org Journal of Chemistry and Materials Research CMR@iiste.org Journal of Mathematical Theory and Modeling MTM@iiste.org Advances in Physics Theories and Applications APTA@iiste.org Chemical and Process Engineering Research CPER@iiste.org Engineering, Technology and Systems Journals PAPER SUBMISSION EMAIL Computer Engineering and Intelligent Systems CEIS@iiste.org Innovative Systems Design and Engineering ISDE@iiste.org Journal of Energy Technologies and Policy JETP@iiste.org Information and Knowledge Management IKM@iiste.org Journal of Control Theory and Informatics CTI@iiste.org Journal of Information Engineering and Applications JIEA@iiste.org Industrial Engineering Letters IEL@iiste.org Journal of Network and Complex Systems NCS@iiste.org Environment, Civil, Materials Sciences Journals PAPER SUBMISSION EMAIL Journal of Environment and Earth Science JEES@iiste.org Journal of Civil and Environmental Research CER@iiste.org Journal of Natural Sciences Research JNSR@iiste.org Life Science, Food and Medical Sciences PAPER SUBMISSION EMAIL Advances in Life Science and Technology ALST@iiste.org Journal of Natural Sciences Research JNSR@iiste.org Journal of Biology, Agriculture and Healthcare JBAH@iiste.org Journal of Food Science and Quality Management FSQM@iiste.org Journal of Chemistry and Materials Research CMR@iiste.org Education, and other Social Sciences PAPER SUBMISSION EMAIL Journal of Education and Practice JEP@iiste.org Journal of Law, Policy and Globalization JLPG@iiste.org Journal of New Media and Mass Communication NMMC@iiste.org Journal of Energy Technologies and Policy JETP@iiste.org Historical Research Letter HRL@iiste.org Public Policy and Administration Research PPAR@iiste.org International Affairs and Global Strategy IAGS@iiste.org Research on Humanities and Social Sciences RHSS@iiste.org Journal of Developing Country Studies DCS@iiste.org Journal of Arts and Design Studies ADS@iiste.org
  • 10. The IISTE is a pioneer in the Open-Access hosting service and academic event management. The aim of the firm is Accelerating Global Knowledge Sharing. More information about the firm can be found on the homepage: http://www.iiste.org CALL FOR JOURNAL PAPERS There are more than 30 peer-reviewed academic journals hosted under the hosting platform. Prospective authors of journals can find the submission instruction on the following page: http://www.iiste.org/journals/ All the journals articles are available online to the readers all over the world without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. Paper version of the journals is also available upon request of readers and authors. MORE RESOURCES Book publication information: http://www.iiste.org/book/ IISTE Knowledge Sharing Partners EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open Archives Harvester, Bielefeld Academic Search Engine, Elektronische Zeitschriftenbibliothek EZB, Open J-Gate, OCLC WorldCat, Universe Digtial Library , NewJour, Google Scholar