Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
Performance evaluation and estimation model using regression method for hadoop word count.
for more ieee paper / full abstract / implementation , just visit www.redpel.com
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Similar to Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics? (20)
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
1. Which NoSQL Database to Combine with Spark for
Real Time Big Data Analytics ?
Abstract— Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technologi-
cal period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the per-
formance of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Keywords- big data analyticsy; NoSQL databases; Apache Spark ;
Hadoop; MongoDB, performance .
I. INTRODUCTION
The Big Data phenomenon, for companies, covers two real-
ities: on the one hand this explosion of data continuously, on
the other hand the capacity to process and analyze this great
mass of data to make a profit. With Big Data, organizations can
now manage and process massive data to extract value, decide
and act in real time.
NoSQL databases were developed to provide a set of new
data management features while overcoming some limitations
of currently used relational databases [1]. NoSQL databases are
not relational and they don’t require a model or structure for
data storage, which facilitates the storage and data search. In
addition, they allow horizontal scalability, it gives administra-
tors the ability of increasing the number of server machines to
minimize overall system load. The new nodes are integrated
and operated in an automatic manner by the system. Horizontal
scalability reduces the response time of queries with a low cost.
In relation to the NoSQL databases (Hadoop, MongoDB,
Cassandra, Hbase, Radis, Riak…., etc.), a new profession
appeared "the data scientist". Data science is the extraction of
knowledge from data sets [2, 3]. It employs techniques and
theories derived from several other broader areas of mathe-
matics, mainly statistics, probabilistic models, machine learn-
ing. Thus, to develop algorithms in a distributed environment,
the analyst must master tools of big data analytics (Mahout,
MapReduce, Spark and Storm) and learn the syntax of func-
tional languages to use Scala, Erlang or Clojure.
Big data analytics therefore favors a return to grace of
functional languages and robust methods: decision tree [4, 5],
and random forest [6], k-means [7], Naive Bayes classifier [8],
easily distributable (MapReduce) on thousands of nodes.
For collected data storage, any NoSQL database can fulfill
this role. However, the need to analyze this data pushes us to
choose this database carefully. Especially in the field of Big
Data, the analytic part becomes more and more important. For
advanced, real-time analytics, the best framework you can use
is Apache Spark [9, 10]. According to the official version,
Spark uses the hadoop HDFS file system.
In a previous study [11] based on a multicriteria analysis
method, the MongoDB system obtained the highest score.
Today, this result was confirmed. This system has become
popular [12]. According to a white paper [13] published by
MongoDB, The combination of the fastest analysis engine
(Spark) with the fastest-growing database (MongoDB) allows
companies to easily perform reliable real-time analysis. This
led us to compare Spark's performance against the most popu-
lar NoSQL databases, MongoDB and Hadoop. In this article,
we will present and discuss the results of our experimental
study. Thus, we will determine the software combination that
allows giving sophisticated analyzes in real time.
This paper is organized as follows: Section II presents big
data analytics on Hadoop and MongoDB. In section III, we
present the results of an experimental study on the perfor-
mance of the framework Spark with MongoDB and Hadoop.
Section IV provides a conclusion.
II. BIG DATA ANALYTICS
In this part, we will introduce the data analysis technologies
used on Hadoop and MongoDB.
A. Big Data Analytics on Hadoop
The first integrated solution with Hadoop for data analysis
is the MapReduce framework. MapReduce is not in itself an
element of databases. This distributed information processing
approach takes an input list, produces one in return.it can be
used for many situations; it is well suited for distributed pro-
cessing needs and decision-making processes.
Omar HAJOUI, Mohamed TALEA
LTI Laboratory, Faculty of Science Ben M’Sik
Hassan II University, Casablanca, Morocco
{hajouio, taleamohamed}@yahoo.fr
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
43 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. MapReduce defined in 2004 in an article written by
Google. The principle is simple: to distribute a treatment,
Google imagined a two-step operation. First, an assignment of
operations on each machine (Map) followed processing by a
grouping of results (Reduce). The needs of Google that gave
birth to MapReduce are twofold: how to handle gigantic vol-
umes of unstructured data (web pages to analyze to feed the
Google search engine, or the analysis of the logs produced by
the work of its indexing engines, for example), to derive re-
sults from calculations, aggregates, summaries ... in short,
from the analysis.
The free reference implementation of MapReduce is called
Hadoop, a system developed by a team led by Doug Cutting,
in Java, for the purposes of its Nutch distributed indexing
engine for Yahoo! Hadoop directly implements the Google
document on MapReduce, and bases its distributed storage on
HDFS (Hadoop File System), which implements the Google
document on GFS (Google File System). Then, the Hadoop
MapReduce Framework (YARN) implemented by several
NoSQL databases such as Hbase, Cassandra, etc.
Then, Facebook developed the HQL language (Hive lan-
guage query) on Hive. Close to SQL to query HDFS. Another
language, called Pig, developed by Yahoo similar in its syntax
to Perl and aimed at the goals of Hive. In addition, cloudera,
another Hadoop distribution, integrates Impala's queries en-
gine. Analysts and data scientists to perform analysis on data
stored in Hadoop via SQL tools or business intelligence tools
favor this latest one. The Mahout project provides algorithms
implementations for business intelligence. It provides, for
example, machine-learning algorithms (Kmeans, Random
Forest).
B. Big Data Analytics on MongoDB
MongoDB is an open-source document-oriented database
designed for exceptionally high performance and developed in
C ++. Data is stored and queried in BSON format similar to
JSON. It has dynamic and flexible schemas, making data inte-
gration easier and faster than traditional databases. Unlike
NoSQL databases that offer basic queries. Developers can use
MongoDB native queries and data mining capabilities to gen-
erate many classes of analysis, before having to adopt dedicat-
ed frameworks such as Spark or MapReduce for more special-
ized tasks.
Several organizations including McAfee, Salesforce,
Buzzfeed, Amadeus, KPMG and many others rely on Mon-
goDB's powerful query language, aggregations and indexing
to generate real-time analytics directly on their operational
data. MongoDB users have access to a wide range of queries,
projection and update operators that support real-time analytic
queries on operational data:
• The MongoDB Aggregation Pipeline is similar in
concept to the SQL GROUP BY statement, enabling
users to generate aggregations of values returned by
the query (e.g., count, minimum, maximum, average,
intersections) that can be used to power analytics
dashboards and visualizations.
• Range queries returning results based on values de-
fined as inequalities (e.g., greater than, less than or
equal to, between)
• Search queries return results in relevance order and
in faceted groups, based on text arguments using
Boolean operators (e.g., AND, OR, NOT), and
through bucketing, grouping and counting of query
results.
• MongoDB provides native support for MapReduce,
allowing complex JavaScript processing. Multiple
MapReduce jobs can run simultaneously on the same
server and on fragmented collections.
• JOINs , Graph queries , Key-value queries ...
C. Big Data Analytics on Hadoop
The MapReduce framework, despite being widely used by
companies for the analysis of Big Data, the response time is
not satisfactory and its programs executed only in the form of
a batch. After a map or reduce operation, the result must be
written to disk. This disk-written data allows mappers and
reducers to communicate with each other. It is also the write
on disk, which allows a certain tolerance to the failures: if a
map or reduce operation fails, it is enough to read the data
from the disk to take again, where we were. However, these
writings and readings are time consuming. In addition, the
expression set composed exclusively of map and reduce op-
erations is very limited and not very expressive. In other
words, it is difficult to express complex operations using only
this set of two operations.
Apache Spark is an alternative to Hadoop MapReduce for
distributed computing that aims to solve both of these prob-
lems. The fundamental difference between Hadoop MapRe-
duce and Spark is that Spark writes data in RAM, not on disk.
This has several important consequences on the speed of cal-
culation processing as well as on the overall architecture of
Spark.
Spark offers a complete and unified framework (Figure 1)
to meet the needs of Big Data processing for various datasets,
various by their nature (text, graph, etc.) as well as by the type
of source (batch or real-time flow). It allows to quickly write
applications in Java, Scala or Python and includes a set of
more than 80 high-level operators, it is possible to use it inter-
actively to query the data from a shell, in addition to the op-
erations of Map and Reduce, Spark supports SQL queries and
data streaming and offers machine learning and graph-oriented
processing functions. Developers can use these possibilities in
stand-alone or by combining them into a complex processing
chain.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
44 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. Figure 1: Apache Spark Ecosystem
Spark's programming model is similar to MapReduce, ex-
cept that Spark introduces a new abstraction called Resilient
Distributed Datasets (RDDs). Using RDDs, Spark can provide
solutions for several applications that previously require the
integration of multiple technologies, including SQL, stream-
ing, machine learning and graph processing.
A Dataset is a distributed collection of data. It can be
viewed as a conceptual evolution of RDDs (Resilient Distrib-
uted Datasets), historically the first distributed data structure
used by Spark. A DataFrame is a Dataset organized into col-
umns that have names, such as tables in a database. With the
Scala programming interface, the DataFrame type is simply
the alias of the Dataset [Row] type.
It is possible to apply actions to the Datasets, which pro-
duce values, and transformations, which produce new Da-
tasets, as well as certain functions that do not fit into either
category.
Figure 2: Spark Command lines Example
Spark exposes RDDs through a functional programming
API in Scala, Java, Python, and R, where users can simply
pass local functions to run on the cluster.
III. COMPARISON
A. The Experiments Results
We made the comparison on files of the same size and type
(.CSV).The test files are available on this link
"https://catalog.data.gov/dataset/crimes-2001-to-present
398a4". We copied each file to the Hadoop file system. Then
the same file imported by MongoDB.
We did the test on one node, three nodes and four nodes.
The machines used having the following configuration:
• 8GB RAM
• Linux Fedora 26
• 120 GB SSD
• 6th generation i5 processor
Table 1: Spark's performance with Hadoop and MongoDB
Nodes
File size
(GB)
Action Hadoop MongoDB
1 1,55
first 96 ms 77 ms
count 10 s 2,0 min
3 3,11
first 90 ms 65 ms
count 19 s 3,4 min
4 4,66
first 0,1 s 57 s
count 29 s 5,3 min
These results are illustrated in the following figure:
Figure 3: Comparison of Spark's performance versus Hadoop
and MongoDB
B. Results Interpretation
According to the results of this study, the execution time of
the first operation that looks for the first record of the file is
the same on Hadoop or MongoDB, sometimes Spark is faster
with MongoDB, but the execution time of the operation count
that requires the change of the entire file in memory in a RDD,
Spark is much faster with Hadoop.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
45 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. For the moment, Hadoop remains the best global storage
solution with administration that is more advanced, security
and monitoring tools. This choice, Oracle did for its brand
new data discovery and analysis solution, Big Data Discovery.
The product installs on a Hadoop cluster (exclusively
Cloudera) and relies heavily on Spark for its treatments.
IV. CONCLUSION
In this article, we presented the results of an experimental
study on the performance of the best framework of Big Ana-
lytics (Spark) with the most popular databases of NoSQL
MongoDB and Hadoop. The aim of this study is to determine
the software combination that allows sophisticated analysis in
real time. According to the results of this study, Spark is much
faster with Hadoop.
REFERENCES
[1] NoSQL , http://nosql-database.org/, 2018
[2] Vasant Dhar, « Data Science and Prediction », Communications of the
ACM, no 12, décembre 2013, p. 64-73
[3] Davenport et DJ Patil « Data Scientist: The Sexiest Job of the 21st
Century », Harvard Business Review, 2012
[4] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees. Chapman & Hall, New York.
[5] L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996.
[6] L. Breiman. Random forests. Machine Learning, 45, 2001.
[7] MacQueen J. B. (1967). Some Methods for classification and Analysis
of Multivariate Observations. Proceedings of 5th Berkeley Symposium
on Mathematical Statistics and Probability 1. University of California
Press. pp. 281-297.
[8] Marron M. E. (1961). Automatic Indexing : An Experimental Inquiry,
Journal of the ACM (JACM), VOl. 8 : Iss. 3, pp 404-417.
[9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ...
& Ghodsi, A. (2016). Apache Spark: A unified engine for big data
processing. Communications of the ACM, 59(11), 56-65.
[10] Gopalani, S., & Arora, R. (2015). Comparing apache spark and map
reduce with performance analysis using K-means. International Journal
of Computer Applications, 113(1).
[11] Omar, H., Rachid, D., Mohammed, T., Zouhair, I., B.: An Advanced
Comparative Study of the Most Promising NoSQL and NewSQL
Databases With a Multi-Criteria Analysis Method. Journal of theoretical
and applied information technology, Vol81, No3.
[12] Solid IT (2018) , https://db-engines.com
[13] https://www.mongodb.com/collateral/apache-spark-and-mongodb-
turning-analytics-into-real-time-action
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
46 https://sites.google.com/site/ijcsis/
ISSN 1947-5500