This document discusses and compares various big data analytics software and tools. It begins with an abstract describing how companies now use big data analytics software to handle large amounts of data. The document then provides an executive summary of a research study analyzing how over 50 companies use big data analytics. The main body compares the features and benefits of various popular big data analytics software, including Apache Hadoop, CDH, Cassandra, Knime, Datawrapper, MongoDB, Lumify, HPCC, Storm, SAMOA, Talend, and RapidMiner. It also discusses analyzing data sets using the R programming language. The conclusion emphasizes how big data tools help transform large amounts of raw data into useful analytics and insights.
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
1. 2
Running Head: BIG DATA PROCESSING OF SOFTWARE
AND TOOLS
2
BIG DATA PROCESSING OF SOFTWARE AND TOOLS
University of the Cumberlands
Big Data Processing of Software and Tools
Data Science & Big Data Analytics
ITS 836-21 Group-1
Prof: Gamini Bulumulle
Date submitted: 02/23/2020
Submitted By:
Table of contents
Abstract..................................................................................
.....................................3
Executive
summary.................................................................................
....................4
3. ·
RapidMiner.............................................................................
..................11
Analyzing the data sets using R
language…………………………………………..12
Conclusion..............................................................................
....................................12
References..............................................................................
.....................................14
Abstract
The concept of big data analytics has been used over the years
and most companies have embraced the idea, to harness data
that is being used in their day to day company routines.
Companies can apply analytics and receive huge benefits from
it, back in the 1950s, companies were using big data in in terms
of spreadsheet analysis. This was a crude form of big data
analytics used to reveal small bits of data and data patterns.
Nowadays companies use big data analytics software to handle
huge chunks of data because it has a variety of benefits to
businesses. Some of the advantages of big data analytics
include: the speed in handling data, efficiency and productivity.
Many businesses prefer to accumulate huge data and later run
analytics of the data to be used for future references in the
company. Big data analytics ensures that businesses make the
right choices when it comes to handling data in the
organization. The ability of big data to work quicker and remain
efficient gives companies the advantage that they did not have
previously. This research paper will focus majorly on the big
data analytics software and their benefits to an organization.
Keywords: Big data, analysis, spreadsheet, efficiency,
organization
4. Executive summary
Big data analysis software gives organizations the ability to get
new ideas based the results of the analysis. It then encourages
more effective and efficient business ideas, increased benefits,
increased proficiency, and happy clients. In a research by Tom
Davenport more than fifty companies were analysed to see how
they employed the use of big data analysis software
(Chandarana, P., & Vijayalakshmi, M., 2014, April). The
conclusions that were made from the research was that there
were decreased costs when it comes to data analysis. The
companies that were using big data analytics software such as
Apache Hadoop and a cloud based analysis had reduced costs
when it comes to storage and analysis of data, these companies
also had an upper hand in making business decisions. The
research also proved that the companies that were making use of
bid data analysis software were quicker and had better dynamics
when analysing data. With in memory analysis and Hadoop,
combined with the ability to analyse new collections of data,
companies can be able to analyse data with a considerable speed
and come into conclusion based on the results of the analyses.
With the use of big data analytics software, there is an
increased ability to measure the needs of customers and know
what they need. Davenports research brings emphasis on bid
data analytics, there is an increased understanding of the needs
of the clients and better ways to address these issues.
Nowadays, many organizations widely use big data analytics to
make a big difference in the market. With open source big data
analytics software, the most valuable sections of the
organizations are secure, expenses are reduced. Hadoop is one
of the best big data analytics software that most business
5. currently use and many vendors currently employ the services
of Hadoop.
Hypothetically, a company may be faced with the need to do
market analysis in order to ascertain the trends in the market.
This scenario calls for the use of big data to help in the
marketing trend analysis. Big data software such as Hadoop,
Apache SAMOA, Casandra and Datawrapper can be used to
analyse the data and come up with an idea of what the market
looks like. All the software listed above play a role when it
comes to market trend analysis. For example, Hadoop will be
used to analyse huge data sets and help in giving out
information that relates to the future trends in that line of
business. Datawrapper will help the organization to perceive the
type of information to be analysed for market trends.
Big data analytics software
There are many things that come to the limelight when it comes
to the use of big data analytics in the modern world. Some of
the things that come to mind when it comes to big data include
what analysis software are to be used, how big the data indices
are, what is the normal data yield within an organization and so
on (Bhosale, H. S., & Gadekar, D. P., 2014). Big data analysis
6. can be broadly classified in the following ways: improvement
stages, advanced devices, as analysis instruments, for data
analytics and other analysis devices. Some of the software used
for big data analytics include the following:
Apache Hadoop
This software is used in big data analytics to analyse huge
chunks of data and grouped file systems. Hadoop forms a part of
big data and MapReduce model of programming. It is an open
source software that uses Java programming to give a cross
functional support and analysis of data. It is one of the widely
used analytics software. Research has it that more that fifty
Fortune companies use Hadoop in their data analysis systems.
Some of the noteworthy companies that use Hadoop include
Facebook, Intel, Amazon Web services, Hortonworks, IBM
statistics, Microsoft and many more.
The are many benefits that comes with using Hadoop and some
of them are listed below: the entire system of Hadoop has a
distributed file system which has the capacity to carry all kinds
of data such as pictures, XML, JSON, Hadoop is also very
valuable when it comes to R&D uses, the software also has an
advantage when it comes to access to data, the tool is highly
versatile and easily accessible when it comes to using a system
of computers. However, there are many disadvantages that come
with using Hadoop. Some of the downfalls include the issue of
repetition and a reduced functionality when it comes to I/O
activities.
CDH (Cloudera Distribution for Hadoop) software
CDH focuses on big merchantry matriculation arrangements of
that innovation. It is a thoroughly open-source and has a self-
ruling stage plagiarism that includes Apache Hadoop, Apache
Spark, Apache Impala, and some more. It permits you to gather,
process, oversee, find, model, and circulate widespread
information. Benefits of using CDH software: Comprehensive
dissemination, Cloudera Manager oversees the Hadoop group
well indeed, Easy usage, Less ramified organization, Upper
7. security and wardship. Disadvantages of using CDH software
include: Few muddling UI highlights like outlines on the CM
administration, Multiple prescribed methodologies for
establishment sound befuddling and, in any case, the Licensing
forfeit on a for every hub premise is truly costly.
Cassandra
Apache Cassandra is liberated from forfeit and open-source
sparse NoSQL DBMS ripened to oversee immense volumes of
information spread over various item servers, conveying upper
accessibility. It utilizes CQL (Cassandra Structure Language) to
cooperate with the database. A portion of the prominent
organizations utilizing Cassandra incorporates Accenture,
American Express, Facebook, General Electric, Honeywell,
Yahoo, and so on. Benefits of using big Apache Casandra
include: No single purpose of disappointment, Handles big data
rapidly, Log-organized capacity, Automated replication, Linear
tensility and Simple Ring diamond. Disadvantages of using
Casandra include: Requires some spare endeavours in
investigating and upkeep, Clustering could have been improved
and Row-level locking highlight isn't there.
Knime
KNIME represents Konstanz Information Miner which is an
open-source device that is used for Enterprise detailing,
incorporation, analytics, CRM, information mining, information
analysis, content mining, and merchantry insight. It underpins
Linux, OS X, and Windows working frameworks. It very well
may be considered as a decent option in unrelatedness to SAS.
A portion of the top organizations utilizing Knime incorporates
Comcast, Johnson and Johnson, Canadian Tire, and so forth.
Benefits of using KNIME include: Simple ETL activities, it
integrates very well with variegated innovations and dialects,
Rich numbering set, highly usable and sorted out work
processes, automates an unconfined deal of transmission work,
no steadiness issues and Easy to set up. Disadvantages of using
KNIME software for data analytics: Data dealing with a limit
can be improved, it occupies nearly the whole RAM and it
8. Could have permitted joining with diagram databases.
Datawrapper
Datawrapper is an open-source stage for information perception
that guides its clients to produce basic, word-for-word and
embeddable outlines rapidly. Its significant clients are
newsrooms that are spread everywhere throughout the world. A
portion of the names incorporates The Times, Fortune, Mother
Jones, Bloomberg, Twitter and so forth. Benefits of using
Datawrapper for big data analytics: The device is well tending
of. Works very well on all sorts of gadgets – versatile, tablet or
work area, fully responsive, Fast, Interactive, brings all the
diagrams in a single spot, Unconfined customization and fare
choices and It requires zero coding. Disadvantages: Limited
shading palettes
MongoDB
MongoDB is a NoSQL, report serried database written in C, C
#, and JavaScript. It is unviable to utilize and is an open-source
device that bolsters variegated working frameworks including
Windows Vista (and later forms), OS X (10.7 and later forms),
Linux, Solaris, and FreeBSD. Its primary highlights incorporate
Aggregation, Adhoc-inquiries, Uses BSON group, Shading,
Indexing, Replication, Server-side execution of JavaScript,
Schema less, Capped assortment, MongoDB the workbench
wardship (MMS), load adjusting and record stockpiling. A
portion of the significant clients utilizing MongoDB
incorporates Facebook, eBay, MetLife, Google, and so on.
Benefits of using MongoDB for big data analytics: Easy to
learn, Provides support for various innovations and stages, No
hiccups in establishment and support, Reliable and minimal
effort. Disadvantages of using MongoDB for big analytics:
Limited analytics and Slow for unrepeatable utilization cases.
Lumify
Lumify is a self-ruling and open-source instrument for big data
combination/reconciliation, analysis, and representation. Its
essential highlights incorporate full-content pursuit, 2D and 3D
orchestration perceptions, programmed formats, connect
9. analytics between diagram elements, combined with mapping
frameworks, geospatial analysis, sight and sound analytics, a
continuous coordinated effort through a lot of undertakings or
workspaces. Benefits of using Lumify for big data analytics:
Scalable, Secure, supported by a single-minded full-time urging
group, Supports the cloud-based condition. Functions admirably
with Amazon's AWS.
HPCC
HPCC represents High-Performance Computing Cluster. This is
a finished big data wattle over an uncommonly versatile
supercomputing stage. HPCC is likewise alluded to as DAS
(Data Analytics Supercomputer). This device was created by
LexisNexis Risk
Solution
s. This workings are written in C and an information-driven
programming language knowns as ECL (Enterprise Control
Language). It depends on Thor engineering that bolsters
information parallelism, pipeline parallelism, and framework
parallelism. It is an open-source device and is a decent
substitute for Hadoop and some other Big information stages.
Benefits of using HPCC for big data analytics: The engineering
depends on product processing groups which requite superior,
Parallel information preparing, Fast, incredible and profoundly
adaptable, supports superior online inquiry applications, and it
is Cost-powerful and exhaustive.
Storm
10. Apache Storm is a cross-stage, conveyed stream handling, and
shortcoming tolerant unvarying computational structure. It is
self-ruling and open-source. The designers of the tempest
incorporate Back type and Twitter. It is written in Clojure and
Java. Its engineering depends on tweaked gushes and darts to
portray wellsprings of data and controls to indulge cluster,
sparse handling of unbounded surges of information. Among
many, Groupon, Yahoo, Alibaba, and The Weather Channel are
a portion of the well-known organizations that utilization
Apache Storm. Benefits of using Apache storm for big data
analytics: Reliable at scale, very quick and shortcoming
tolerant, Guarantees the handling of information, it has
numerous utilization cases – ongoing analytics, log preparing,
ETL (Extract-Transform-Load), resulting calculation, conveyed
RPC, AI. Disadvantages of using Apache storm for big data
analytics: Difficult to learn and utilize, Difficulties with
investigating, and the use of Native Scheduler and Nimbus wilt
bottlenecks.
Apache SAMOA
SAMOA represents Scalable Advanced Massive Online
Analysis. It is an open-source stage for big data stream mining
and AI. It permits you to make sparse spilling AI (ML)
calculations and run them on numerous DSPEs (appropriated
stream preparing motors). Apache SAMOA's nearest elective is
a BigML device. Benefits of using Apache SAMOA for big data
11. analytics: Simple and witty to utilize, Fast and versatile, True
continuous spilling and it has a Write Once Run Anywhere
(WORA) engineering.
Talend
Talend Big information coordination items include: Open studio
for Big information: It goes under self-ruling and open-source
permit. Its parts and connectors are Hadoop and NoSQL. It
gives network perpetuate as it were, Big information stage: It
accompanies a client-based membership permit. Its parts and
connectors are MapReduce and Spark. It gives Web, email, and
telephone support and Real-time big data stage: It goes under a
client-based membership permit. Its parts and connectors
incorporate Spark gushing, Machine learning, and IoT. It gives
Web, email, and telephone support. Benefits of using Talend for
big data analytics: Streamlines ETL and ELT for Big
information, Accomplish the speed and size of sparkle,
accelerates your transition to continuous, handles numerous
information sources and It provides various connectors under
one rooftop, which thus will permit you to redo the wattle equal
to your needs. Disadvantages of using Talend for big data
analytics: Community valuables could have been something
more, could have an improved and simple to utilize interface
and Difficult to add a custom segment to the palette.
RapidMiner
RapidMiner is a cross-stage workings that offers a coordinated
12. domain for information science, AI and prescient analytics. It
goes under variegated licenses that offer little, medium and
huge restrictive versions just as a self-ruling release that takes
into consideration 1 legitimate processor and up to 10,000
information columns. Organizations like Hitachi, BMW,
Samsung, Airbus, and so along have been utilizing RapidMiner.
Benefits of using RapidMiner in big data software analytics:
Open-source Java centre, the repletion of wearing whet
information science instruments and calculations, the facility of
code-discretionary GUI, Integrates well with APIs and cloud,
Superb vendee assistance and specialized help. However, while
using RapidMiner, Online information administrations ought to
be improved.
Analyzing the data sets using R language
Data simulation is the crucial stage in processing raw data to
identify and trace certain patterns and generate the reports to
enhance the productivity. We have taken some sample data set
regarding a computer store, where we did some simulation to
show the different type of RAM available in the store and
simulated to hard disk prices.
13. Conclusion
The computerized age has made it simpler for experts to get to
the information that would permit you to improve your business
execution (Manikandan, S. G., & Ravi, S., 2014). In any case,
to use this data, you will require information examination
programming that can give you devices for information mining,
association, investigation, and perception. Besides, it ought to
be furnished with AI and propelled calculations to change your
crude information into significant bits of knowledge right away.
Along these lines, you can stay aware of business drifts, and
even discover approaches to additionally improve your general
tasks. In any case, there are a lot of components associated with
finding the privilege investigation apparatus for a specific
business. From looking at its exhibition to figuring how well it
plays with different frameworks, the exploration procedure can
be overpowering. In this way, to support you, we have
assembled the main items available and surveyed their
functionalities and ease of use. Big Data tools help us to store
and transform the huge data into analytics to track and
understand to predict certain patterns and gain the productivity
14. of the organization. Thusly, it will be simpler for you to decide
the most ideal information investigation stage for your tasks.
References
Bhosale, H. S., & Gadekar, D. P. (2014). A review paper on big
data and hadoop. International Journal of Scientific and
Research Publications, 4(10), 1-7.
Chandarana, P., & Vijayalakshmi, M. (2014, April). Big data
analytics frameworks. In 2014 International Conference on
Circuits, Systems, Communication and Information Technology
Applications (CSCITA) (pp. 430-434). IEEE.
Manikandan, S. G., & Ravi, S. (2014, October). Big data
analysis using Apache Hadoop. In 2014 International
Conference on IT Convergence and Security (ICITCS) (pp. 1-4).
IEEE.
Talia, D. (2013). Clouds for scalable big data
analytics. Computer, (5), 98-101.
Allen, G., Campbell, F., & Hu, Y. (2015). Comments on
15. “visualizing statistical models”: Visualizing modern statistical
methods for Big Data. Statistical Analysis And Data Mining:
The ASA Data Science Journal, 8(4), 226-228. doi:
10.1002/sam.11272
Griffith, D. (1993). Advanced spatial statistics for analysing
and visualizing geo-referenced data. International Journal Of
Geographical Information Systems, 7(2), 107-123. doi:
10.1080/02693799308901945
Create a PowerPoint presentation for the Sun Coast Remediation
research project to communicate the findings and suggest
recommendations. Please use the following format:
· Slide 1: Include a title slide.
· Slide 2: Organize the agenda.
· Slide 3: Introduce the project.
. Statement of the Problems
. Research Objectives
· Slide 4: Describe information gathered from the literature
review.
· Slide 5: Include research methodology, design, and methods.
. Research Methodology
. Research Design
16. . Research Methods
. Data collection
· Slide 6: Include research questions and hypotheses
· Slides 7 and 8: Explain your data analysis.
· Slides 9 and 10: Explain your findings.
· Slide 11: Explain recommendations including an explanation
of how research-based decision-making can directly affect
organizational practices.
· Slide 12 and 13: Reflect on your experience throughout the
course. Provide some of the things you learned and some of the
course’s takeaways that you can apply to your current or future
job.
· Slide 14: Include references for your sources.
Your PowerPoint must be a minimum of fourteen slides in
length (including the title slide and a reference slide).
Running head: INSERT TITLE HERE1
INSERT TITLE HERE11
17. Insert Title Here
Insert Your Name Here
Insert University Here
Table of Contents
Include the table of contents here. There is a tool for creating a
table of contents in the References tab of the Microsoft Word
tool bar at the top of the screen. Delete this before you begin.
Executive Summary
The executive summary will go here. The paragraphs are not
indented, and it should be formatted like an abstract. The
executive summary should be composed after the project is
complete. It will be the final step in the project. Delete this
before you begin.
Introduction
Note: The following introduction should remain in the research
project unchanged. Delete this note before you begin.
Senior leadership at Sun Coast has identified several areas for
18. concern that they believe could be solved using business
research methods. The previous director was tasked with
conducting research to help provide information to make
decisions about these issues. Although data were collected, the
project was never completed. Senior leadership is interested in
seeing the project through to fruition. The following is the
completion of that project and includes the statement of the
problems, literature review, research objectives, research
questions and hypotheses, research methodology, design, and
methods, data analysis, findings, and recommendations.
Statement of the Problems
Note: The following statement of the problems should remain in
the research project unchanged. Delete this note before you
begin.
Six business problems were identified:
Particulate Matter (PM)
There is a concern that job-site particle pollution is adversely
impacting employee health. Although respirators are required in
certain environments, PM varies in size depending on the
project and job site. PM that is between 10 and 2.5 microns can
float in the air for minutes to hours (e.g., asbestos, mold spores,
pollen, cement dust, fly ash), while PM that is less than 2.5
microns can float in the air for hours to weeks (e.g. bacteria,
viruses, oil smoke, smog, soot). Due to the smaller size of PM
that is less than 2.5 microns, it is potentially more harmful than
19. PM that is between 10 and 2.5 since the conditions are more
suitable for inhalation. PM that is less than 2.5 is also able to be
inhaled into the deeper regions of the lungs, potentially causing
more deleterious health effects. It would be helpful to
understand if there is a relationship between PM size and
employee health. PM air quality data have been collected from
103 job sites, which is recorded in microns. Data are also
available for average annual sick days per employee per job-
site.
Safety Training Effectiveness
Health and safety training is conducted for each new contract
that is awarded to Sun Coast. Data for training expenditures and
lost-time hours were collected from 223 contracts. It would be
valuable to know if training has been successful in reducing
lost-time hours and, if so, how to predict lost-time hours from
training expenditures.
Sound-Level Exposure
Sun Coast’s contracts generally involve work in noisy
environments due to a variety of heavy equipment being used
for both remediation and the clients’ ongoing operations on the
job sites. Standard ear-plugs are adequate to protect employee
hearing if the decibel levels are less than 120 decibels (dB). For
environments with noise levels exceeding 120 dB, more
advanced and expensive hearing protection is required, such as
earmuffs. Historical data have been collected from 1,503
20. contracts for several variables that are believed to contribute to
excessive dB levels. It would be important if these data could
be used to predict the dB levels of work environments before
placing employees on-site for future contracts. This would help
the safety department plan for procurement of appropriate ear
protection for employees.
New Employee Training
All new Sun Coast employees participate in general health and
safety training. The training program was revamped and
implemented six months ago. Upon completion of the training
programs, the employees are tested on their knowledge. Test
data are available for two groups: Group A employees who
participated in the prior training program and Group B
employees who participated in the revised training program. It
is necessary to know if the revised training program is more
effective than the prior training program.
Lead Exposure
Employees working on job sites to remediate lead must be
monitored. Lead levels in blood are measured as micrograms of
lead per deciliter of blood (μg/dL). A baseline blood test is
taken pre-exposure and postexposure at the conclusion of the
remediation. Data are available for 49 employees who recently
concluded a 2-year lead remediation project. It is necessary to
determine if blood lead levels have increased.
Return on Investment
21. Sun Coast offers four lines of service to their customers,
including air monitoring, soil remediation, water reclamation,
and health and safety training. Sun Coast would like to know if
each line of service offers the same return on investment.
Return on investment data are available for air monitoring, soil
remediation, water reclamation, and health and safety training
projects. If return on investment is not the same for all lines of
service, it would be helpful to know where differences exist.
Literature Review
After providing a brief introduction to this section, students
should include the literature review information here. Delete
this before you begin.
Research Objectives
After providing a brief introduction to this section, students
should include research objectives here. Delete this before you
begin.
RO1:
RO2:
RO3:
RO4:
RO5:
RO6:
Research Questions and Hypotheses
After providing a brief introduction to this section, students
should state the research questions and hypotheses. Delete this
23. After providing a brief introduction to this section, students
should detail the research design they have selected. Use the
following subheadings to include all required information.
Delete this before you begin.
Research Methodology
Research Design
Research Methods
Data Collection Methods
Sampling Design
Data Analysis Procedures
Data Analysis: Descriptive Statistics and Assumption Testing
After providing a brief introduction to this section, students
should provide the Excel Toolpak results of their descriptive
analyses. Use the following subheadings to include all required
information. Delete this before you begin.
Correlation: Descriptive Statistics and Assumption Testing
Simple Regression: Descriptive Statistics and Assumption
Testing
Multiple Regression: Descriptive Statistics and Assumption
Testing
Independent Samples t Test: Descriptive Statistics and
Assumption Testing
Dependent Samples (Paired-Samples) t Test: Descriptive
Statistics and Assumption Testing
24. ANOVA: Descriptive Statistics and Assumption Testing
Data Analysis: Hypothesis Testing
After providing a brief introduction to this section, students
should provide the Excel Toolpak results of their hypothesis
testing. Use the following subheadings to include all required
information. Delete this before you begin.
Correlation: Hypothesis Testing
Simple Regression: Hypothesis Testing
Multiple Regression: Hypothesis Testing
Independent Samples t Test: Hypothesis Testing
Dependent Samples (Paired Samples) t Test: Hypothesis Testing
ANOVA: Hypothesis Testing
Findings
After providing a brief introduction to this section, students
should discuss the findings in the context of Sun Coast’s
problems and the associated research objectives and questions.
Important Note: Students should refer to the information
presented in the Unit VII Study Guide and the Unit VII Syllabus
instructions to complete this section of the project. Restate each
research objective, and discuss them in the context of your
hypothesis testing results. The following are some things to
consider. What answers did the analysisprovide to your research
questions? What do those answers tell you? What are the
implications of those answers? Delete these statements before
you begin.
25. Example:
RO1: Determine if a person’s height is related to weight.
The results of the statistical testing showed that a person’s
height is related to their weight. It is a relatively strong and
positive relationship between height and weight. We would,
therefore, expect to see in our population taller people having a
greater weight relative to those of shorter people. This
determination suggests restrictions on industrial equipment
should be stated in maximum pounds allowed rather than
maximum number of people allowed.
RO2:
RO3:
RO4:
RO5:
RO6:
Recommendations
After providing a brief introduction to this section, students
should include recommendations here in paragraph form. This
section should be your professional thoughts based upon the
results of the hypothesis testing. You are the researcher, and
Sun Coast's leadership team is relying on you to make evidence-
based recommendations. Delete these statements before you
begin.
References
26. Include references here using hanging indentations, and delete
these statements and example reference.
Creswell, J. W., & Creswell, J. D. (2018). Research design:
Qualitative, quantitative, and mixed methods approaches (5th
ed.). Thousand Oaks, CA: Sage.