Paper presented International Conference on Data Science and Analytics - ICDSA'21 organized by Rathinam College of Arts and Science, Tamil Nadu, India on 19th February 2021
SQL Database Design For Developers at php[tek] 2024
Data lakes a tool for minimizing expenditure on storage
1. Data Lakes a tool for minimizing expenditure on
storage – A survey
Muvvala Sai Phanindra
3rd
year B.Tech, Department of Information Technology,
Hindustan Institute of Technology and Science,
#1,IT Expressway, Bay Range Campus, Padur, Chennai–
603103, Tamil Nadu, India.
18132003@student.hindustanuniv.ac.in
Dr. C.V. Suresh Babu
Professor, Department of Information Technology,
Hindustan Institute of Technology and Science,
#1,IT Expressway, Bay Range Campus, Padur, Chennai–
603103, Tamil Nadu, India.
pt.cvsuresh@hindustanuniv.ac.in
Abstract— This paper was an intensive survey study on the
Data Lakes, on how it has been used for minimizing expenditure
on the increasing need of storage.
Keywords—Cost Benefit Analysis, Economics, Data Lakes,
Storage.
I. INTRODUCTION
Often System failure affects the process of storing data and is
making more difficult to work with. Data Lakes can be
created for a permanent connection between the device, that
is sending data, and the system that is receiving it, as a
solution to this problem.
II. RATIONALE BACKGROUND:
The basic need for this study is increases in Expenditure on
storage devices due to rapid data growth
III. OBJECTIVES
Primary objective: This expenditure on storage can be
minimized using data lakes.
Secondary Objectives: Using this data lakes we can
minimize expenditure on storage devices without
compromising the security of the raw data we are storing.
IV. REVIEW OF LITERATURE
The implementation of Real-Time Analytics tools may be
expensive, it will eventually save a lot of money. Some
tools of it like Hadoop and Cloud-Based Analytics can
bring cost advantages to business when large amounts of
data are to be stored and these tools also help in
identifying more efficient ways of doing business.
(Abdelrahman Elsharawy., January 1, 2019).
Modern tools are allowing analysts to analyze more data,
more quickly, which increases their personal productivity.
In addition, the insights gained from that analytics often
allow organizations to increase productivity more broadly
throughout the company. (Amy ElMahalawy., January 1,
2019).
Scientists and experts are among the most highly coveted
and highly paid workers in the IT field. Respondents
ranked skills and staff as the second biggest challenge
when creating a data lake. Hiring or training staff can
increase costs considerably, and the process of acquiring
skills can take considerable time. (Khaled Gad., January
1, 2019).
Now a days we are using mostly incompatible tools.
Hadoop is the most commonly used tool for analytics.
However, the standard version of Hadoop is not currently
able to handle real-time analysis. (Islam Mousa., January
1, 2019).
Many of today’s tools rely on open-source technology,
which dramatically reduces software costs, but
enterprises still face significant expenses related to
staffing, hardware, maintenance, and related services. It’s
not uncommon for big data analytics initiatives to run
significantly over budget and to take more time to deploy
than IT managers had originally anticipated. (Mostafa
Elshahawy., January 1, 2019).
A large part of this new data on which researchers work
belongs to companies (which aggregate them from their
clientele), and the benefits for these companies of
benefiting from researchers' knowledge of these data are
not always comparable to the costs of disclosing the data.
The unstructured nature of the data, which represents a
challenge in econometric terms - just to separate the
dependencies between the series studied; this is the most
important technical challenge with this type of data,
which requires the development of new regression tools.
(Alex Bekker,.march 21 2018)
The need for economists brought to develop new skills -
and more specifically at the level of advanced software
and languages (SQL, R and Xlstat) as well as machine
learning algorithms - in order to be able to combine the
framework conceptual of economic research with the
ability to apply ideas on massive databases; the highly
publicized profession of "data scientist", which consists
of analyzing data in order to find empirical models, is
exactly at the crossroads of computer science and
econometric analysis. The extraction and synthesis of the
various variables and the search for relations between
them will therefore become important parts of the work
of economists and require new skills in computer science
and databases (Alex Bekker,.march 21 2018)
User-Level Algorithms Have Difficulty Answering
“Why”Largely speaking, there are only two ways to
analyze user-level data: one is to aggregate it into a
“smaller” data set in some way and then apply statistical
or heuristic analysis; the other is to analyze the data set
directly using algorithmic methods. Both can result in
predictions and recommendations (e.g. move spend from
campaign A to B), but algorithmic analyses tend to have
difficulty answering “why” questions (e.g. why should we
move spend) in a manner comprehensible to the average
marketer. Certain types of algorithms such as neural
networks are black boxes even to the data scientists who
designed it. Which leads to the next limitation: (Balar
Khalid, T., 2017)
2. User Data Is Not Suited For Producing Learnings This
will probably strike you as counter-intuitive. Big data =
big insights = big learnings, right?
Wrong! For example, let’s say you apply big data to
personalize your website, increasing overall conversion
rates by 20%. While certainly a fantastic result, the only
learning you get from the exercise is that you should
indeed personalize your website. While this result
certainly raises the bar on marketing, but it does nothing
to raise the bar for marketers. . (matthew.aslett, 2016).
Actionable learnings that require user-level data – for
instance, applying a look-alike model to discover
previously untapped customer segments – are relatively
few and far in between, and require tons of effort to
uncover. Boring, ol’ small data remains far more efficient
at producing practical real-world learnings that you can
apply to execution today. (matthew.aslett, 2016).
Bigdata is realization of competitive advantage based on
the fact that it is now more economically feasible to store
and process data that was previously ignored due to cost
and functional limitations of traditional data management
technologies to handle its volume, velocity and variety.
(matthew.aslett, 2016).
Storing and analyzing large volumes of data that is
crucial for a company to work requires a vast and complex
hardware infrastructure. If more and complex data is
stored, more hardware systems will be needed (Alexandru
Adrian TOLE,., 2013).
A hardware system can only be reliable over a certain
period of time. Intensive use and, rarely, production faults
will most certainly result in a system malfunction.
Companies can’t afford to lose data that they gathered in
the past years, neither to lose their clients. For avoiding
such catastrophic events they use a backup system that
does the simple operation of storing all data. By doing
this, companies obtain continuity, even if they are drawn
back temporary. The challenge is to maintain the level of
services that they provide when (Alexandru Adrian
TOLE,., 2013).
A server malfunction occurs right when a client is
uploading files on it. To achieve continuity, hardware
systems are backed by software solutions that respond in
order to maintain fluency by redirecting traffic to another
system. When a fault occurs, usually a user is not affected
and he/she continues work without even noticing that
something has happened. System failure The flow of data
must not be interrupted in order to obtain accurate
information. For example, Google is sending one search
request to multiple servers, rather than sending it to only
one. By doing this, the response time is shortened and also
there is no inconsistency in the data that users sends –
receives. (Alexandru Adrian TOLE,., 2013).
To avoid this from happening, for any content that is
transmitted, the sender must generate a “key”. This key is
then transferred to the receiver to compare it with the key
that it generated regarding the data that was received. If
both keys are identical than the “send-receive” process
was successfully completed. For better understanding,
this solution is similar with the MD5 Hash that is
generated over a compressed content. But, in this case, the
keys are compared automatically (Alexandru Adrian
TOLE,., 2013).
Loosing data is not always a hardware problem. Software
can as well malfunction and cause irreparable and more
dangerous data loss. If one hard drive fails, there is
usually another one to back it up, so there is no harm done
to data, but when software fails due to programming
“bug” or a flaw in the design, data is lost forever. To
overcome this problem, programmers developed series of
tools that will reduce the impact of a software failure. A
simple example is Microsoft Word, which saves from
time to time the work that a user is doing in order to
prevent the loss of it in case of hardware or software
failure. This is the basic idea of preventing complete data
loss. (Alexandru Adrian TOLE,., 2013).
Analytics can be hard to scale as an organization and the
amount of data it collects grows. Collecting information
and creating reports becomes increasingly complex. A
system that can grow with the organization is crucial to
manage this issue. While overcoming these challenges
may take some time, the benefits of data analysis are well
worth the effort. Improve your organization today and
consider investing in a data analytics system. (Rebecca
Webb,. November 25, 2020).
There is a skills shortage for data scientists. Closing this
gap, however, is proving to be extremely difficult. It’s not
just a matter of training people to work with big data
analytics solutions, either. “The data science field has an
experience shortage,” explains Daniel Zhao, a senior
economist at Glassdoor. “There are plenty of recent grads
who can throw a hodgepodge of models at a data set, but
there’s a serious shortage of experienced and qualified
workers who have the full combination of technical skills,
business expertise, and domain knowledge.” (Justin
Reynolds., February 3, 2020)
Many organizations reduce the pain of the data science
skills gap using automated machine learning (AutoML),
which involves automating repetitive tasks. With
AutoML, data scientists can use their time to focus on
business problems instead of getting bogged down with
code. AutoML isn’t the complete answer to the data
science skills crisis. But it can help analytics teams
accomplish more when they lack experienced personnel.
(Justin Reynolds., February 3, 2020)
CapGemini's report found that 37% of companies have
trouble finding skilled data analysts to make use of their
data. Their best bet is to form one common data analysis
team for the company, either through re-skilling your
current workers or recruiting new workers specialized in
big data. You need to find employees that not only
understand data from a scientific perspective, but who
also understand the business and its customers, and how
their data findings apply directly to them. (Ewout Meyns,.
January 31, 2020)
If you’re using multiple channels to capture data, such as
through your website, customer care centre and marketing
leads, you’re running the risk of collecting duplicate
information. There are tools to help you remove duplicate
data - for instance, if you work with Google Contacts, you
can merge your contacts. . (Ewout Meyns,. January 31,
2020)
3. Summary of Review of Literature
From the review of literature, we took away how we can
minimize the data using the methods such as compression,
deduplication, and tiring. It also helped me to integrate my
technology related problem statement with economics and
move further.
V. FUTURE SCOPE OF THE STUDY
We all know that due to rapid data growth throughout the
world we need more efficient and more secured and cheaper
and also easier way to store that huge data. We can further
research on all the possible storage types and take forward
this efficiency level of storage further.
VI. CONCLUSION
Now a days every company is using huge cloud data to store
their data and process the data using big data techniques. This
costs them lots of money to buy cloud storage. One of the
alternatives for this cloud storage is data lakes.
Data lakes: data lakes is huge storage which we can buy with
affordable price and can store raw data.
Disadvantage of data lakes is that the data stored in data lakes
cannot be processed, it’s stored in the form of raw data.
Solution for data lake disadvantage with an example:
Suppose if a company named x exist from the past 50 years.
That company if want to store data, it normally uses any
cloud-based support for storage and process the data using a
database or big data.
But our solution can decrease the amount of money spent on
data storage. That company x can buy data lake with
comparatively cheaper price and store their whole 50 years
data in it. And suppose if the company got some work and
have to process some data from the year of 1998 then the
company can import that particular year data from the data
lake to any local storage and can process the data through big
data.
ACKNOWLEDGMENT
We thank all our Faculty members of our Department and
our classmates and other anonymous reviewers for their
valuable comments on our draft paper.
DISCLOSURE STATEMENT
No potential conflict of interest was reported by the
authors.
REFERENCES
[1] Abdelrahman Elsharawy (January 1, 2019) Advantages and
disadvantages of big data https://www.vapulus.com/en/advantages-
and-disadvantages-of-big-data/
[2] Balar Khalid(august 2019, BIG DATA IN ECONOMIC ANALYSIS:
ADVANTAGES AND CHALLENGES
https://www.researchgate.net/publication/335234998_BIG_DATA_I
N_ECONOMIC_ANALYSIS_ADVANTAGES_AND_CHALLENG
ES
[3] Kulraj Smagh,(October 7 2017) limitations of big data analytics.
https://www.ciklum.com/blog/limitations-of-big-data-analytics/
[4] Liran Einav, Jonathan Levin(November 7, 2014) Economics in the age
of big data Vol. 346, Issue 6210, 1243089 DOI:
10.1126/science.1243089
https://science.sciencemag.org/content/346/6210/1243089
[5] Mila Slesar(January 2020) Pros and cons of big data for businesses
https://onix-systems.com/blog/the-pros-and-cons-of-big-data-for-
businesses