1. GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
1
1
A Survey
on Managing of Heavily Big Data Through Hadoop
Ecosystems
Neha Trivedi
1
Jayoti Vidhyapeeth Women’s University, Jaipur, India
Department Of Computer Science & Engineering
Abstract: On the year of 2000 new technology is arrived with a full of buzz called the big data .that is grows rapidly in the industry. This paper
strive to offer speed of the growing data, various attribute involve to manage this huge data . Data grows swiftly and swiftly every year report says
every year data is twofold by its previous years. Billions of peoples using internet, on the basis of world’s 99% population have at least one Gmail
account. On an average they access 4 times in a day. There is creating gigantic data to handle. So this paper basically compact with how we
administrate this data for storage and analytics on top of it for efficient and logical result. On year 2009 to 2015 data luxuriate more that of its 50 %.
By increases data so fast gives enormous result. This paper also deals with the various challenges in security, speed, volume. this paper consist of
processing of data, security challenges of big data and hadoop frame work and its working criteria for storing and accessing analytical of this wast data.
Index Terms— analytics, data, security, volume
I. INTRODUCTION
It is start with one progression, in early years data is spawn
and accumulated by a worker. employees of company is
immigrate data into computer that is based on (RDBMS) then
things involved to the internet, then end user can generate their
own data i.e.facebook,LinkedIn they are insinuate their own
data. On the basis of volume it is much larger than first
(RDBMS).by employees insinuate their data to user generate
their own data that is much larger in magnitude .all of the
sudden data accumulated is much higher than traditional
system. So now third level of progression is machine
accumulated their own data that is full of monitor, electricity
,satellite .they are continues involving around earth take
pictures ,data , humidity ,temperatures calculating 24 hours a
day so machine generating data is higher than the end user
data.
1. Employees generating data (RDBMS).
2. User generating data from social sites, mails, accounts.
3 .machine generating data collapse amount of data i.e.
satellite monitor.
Machine > user > employees
Big data generating is from
1. Online shopping
2. Facebook
3. Airlines
4. Hospitality data
Figure:1 big data life cycle for accumulation of data
II. HISTORY OF BIG DATA
In year 2002, Google facing a problem of stockpile large data
and analysis. So Google went to the fundamental is used the
concept of distributed file system save the data in that system.
In 1990 data is exponentially increases.
It faces a problem because manually doing by employees so
after that Google write a white paper is called Google file
system (GFS). Person name duck cutting (inventor of Lucian
search engine) duck cutting work on white paper on Google
on year So Google brought the solution of it. Google file
system (GFS), it took 13 year to come and in year 2004
MapReduced concept is occurred. 2002-04. Yahoo engine also
facing the same problem that was Google facing. Yahoo hired
duck cutting for complete implementation of white paper.
Google solved the problem of storage of bulk data by
applying distributed file system and problem is not yet
finished Google also want to do analytics of data top of it so
Google write another white paper that is Google MapReduced
algorithm for solving the problem of data analysis
T
2. GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
2
2
combination of both (Google file system and MapReduced)
called as hadoop . hadoop is basically a tool for solving a large
data problems logically .
So that is two core component of hadoop .hadoop is used to
1. save the large file and data (distributed file system)
2 .perform analytics (MapReduced algorithm)
Sources of generating Big Data
Airbus generates a 10TB every 30 minutes and about 640TB
is generated in one flight. A smart meter reads the usage every
15 minutes records 350 billion transaction in a year. In 2009
there were 76 million smart meters by the time of 2017 there
will be more than 250 million smart meters .More than 2
billion people use internet by 2014 CISCO estimates internet
traffic 4.8 ZB per year.200 million Blog entry on the web.300
billion mails send every day .Facebook twitter LinkedIn this
are company generate a data of approx 25TB daily twitter gent
12 TB daily 200 million user generate 230 million tweets daily
97000 tweets are send per second. 2.9 billion Video hour
watched per month. Big data is also coming from trading
sector they produce 1 TB per day. In 2009 total data estimated
1 ZB and in 2020 it will reached 35ZB.
Why big data matters
Big data: right now we are living in data world. We just only
talk about data so important thing is how to store data and how
to process that data.
The data which is beyond to the storing capacity and
processing power that is big data. Data is creating and
congregation everywhere of maximum possible availability
for analysis doing analytics for the big data parallel universe.
big data matter for analytics on data to get proper result for
future use. big data having different perspective for different
business to healthcare to science department to social network
. As like business perceptive many e commerce sites
collecting data previously used by customers so influencing
them to buy their product. On this healthcare we can generate
a record of every disease for betterment of knowing the
growth we have taken to solve the problems on healthcare
Anticipating and influence
Big data plays a very imp role in anticipating and influence
controlling human activity at some accents. hadoop based
software tracks every information that persons doing online
and show it to the users at some intervals of time to again for
purchasing that is big strategies that many company are doing
for selling this product for example at Amazon I am searching
for the book of “DBMS” I select many books but not purchase
anything . after some time when I logon to the other devices I
found again that result is shown to me so this is the way to
online market to sell their product that only could possible for
that they selling products . 400 terabytes a digital library of all
books ever written in any language .200 Yottabytes a
holographic snapshot of the earth surface.
Processing of data
A. Traditional system
In traditional system, they use RDBMS in early stages data is
come over to processor like (CPU, computer chip) to process
the data now there is bulk of data we cannot process like that
in such a way. Then we generator a solution use multiple
processor and bring processor to data having whole row of
server and each server has some small component and we
process each one parallel that is called parallel processing .
“Before data is brought to processor but now processor is
brought to data “. That means
Before data is bring to one CPU but now data bring to
infinitely number of CPU.Data is grows exponentially so we
brought process to that next level for processing and accessing
this bulky data that is called technological shift .there are two
techniques that is allowing for big data that is called
HADOOP .
Hadoop (open source platform and free to use)
Software that is allowed to happen parallel processing and
map reduced.
It is very strenuous to open a high GB of file in machine is
take prolonged time to open. we are facing this problem in
desktop company data, hospitals for analysis of data so that
take huge amount of time to access it and big data of file is
start with terabytes .
So that hadoop comes into the trade for managing terabytes of
data.
Hadoop provide a Underlined distributed system and
Analytical algorithm on that around 2004 -05. Google working
on white paper, yahoo generating MapReduced algorithm. At
the same time IBM also come into a frame they define the
concept of big data in four different ways
1. Velocity (speed of which data is generated)
2. Varity (different type of data audio, video, text)
3. Volume (size of the huge data)
4. Veracity (data generated is useful or not for future use)
B. Traditional database storage of data
traditional system having fixed schema contain only some
class of function based on RDBMS .They have only minimal
amount of function to deal with data. But In early 2004 05
social sites are coming in market and data generated by this
are more and more unstructured that cannot handled by
RDBMS because of two reasons
1. They do not have this much space to store heavily
bulky large data (10%).
2. Data is unstructured (containing text, images (90%),
audio, video,grapths ).
B(i). Limitations of existing traditional solutions
1. Table has fixed schema.
2. High cost
3. Cannot save high huge file (access time higher)
4. Analytical issue.
.
3. GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
3
3
III. HADOOP AND ITS COMPONENTS
Simple definition of hadoop is:
It is a combination of distributed file system and analytical
algorithm. Beauty of this program is open source data
management system with no cost associated with it
A. Ecosystem of hadoop
Ecosystem is word that refers to the Areas related to field that
is required for that particular thing. Same in the case of
hadoop it composed of HDFS, MapReduced, tools, software
packages like pig, hive.
Components of hadoop
(i).Pig:- pig is the tool for loading the files in system for
manipulating and storing. pig itself explain the meaning of
tool . as pig can eat anything so that Pig Latin language is also
define for every type of language is utilize for storing and
transferring data . pig consist of two things :
Pig Latin (language)
Execution of product (runtime environment)
(ii).Hive –data warehouses deals with small portion of data but
hive covers every population of data. Hive: produces its own
query language (HiveSql). Hive compiled by UDF and
MapReduced.
(iii).Pig Latin data analysis –its abstraction of map reduced
and HDFS data structure
(iv). Mahout Engine –it is an artificial engine that works of
human data for example: by searching some kind of product in
ecommerce website on some browser then close the browser.
After some time again open that browser to do different thing
then again it shows the result of previously searched items. It
basically means that mahout is a artificial engine to understand
the human need and provoke the user to take that product. It is
some kind of strategy applied by ecommerce industry. it is
influenced user to buy that product.
Mahout:-mahout is a library that consist of all the data that
client is accessed previously it basically create a file for
everyone save the content that client is done previously and
uses that data for further use Hive is a platform in the hadoop
ecosystem and
They are 3 things that provide a layer of abstraction on map
reduced algorithm.
Then apache oozie to maintain the curriculum of all the job
that is running on the system.
It helps is to start the job, it helps is to stop the job in the
hadoop world.
It help is to schedule hive job mahout job as soon as possible.
They it is also contains framework
1. Sqoop (structure data i.e. RDBMS)
2. Flume (unstructured data i.e. FB, LinkedIn)
1. Sqoop provides an interesting and very imp facility it takes
the data from RDBMS to put into HDFS storage and vice
versa. it provide very effortless and manageable interface
between RDBMS and HDFS.
2. Flume: it is for unstructured data come to system to HDFS
and HDFS to another place.
Hadoop works on two things
Distributed file system and map reduced system
Hadoop consist two part name node and job tracker. name
node consist storing, managing ,accessing the big data files.
And this part of distributed file system and job tracker is for
analytics of data and this is part of MapReduced.
Name node is master node and data node is a slave node that is
for physically allocation of memory that is directly co related
with name node. NameNode consist of all log files that is
generated with the help of data node.
As same job tracker is master tool and task tracker is salve.
Main work of job tracker is to give instruction to task tracker
for processing the data. job tracker also not aware about the
address where all the data is saved so it send request to name
node for log file so it can get a address to access the data
through task tracker.
Main functions of hadoop
1. Helps to save entire file.
2. MapReduced framework does analytics on the top of
the entire set of file
3. Takes the very small time to do the analytics
4. Scale-out architecture –toggle with the amount of
time it takes for processing.
Hadoop work on master slave method
Name node, data node, secondary name node is for storage of
data. And job tracker and task tracker is for processing the
data. Hadoop is best solution of big data it know very well
how to store the big data and how to process them .
Hadoop: is
1 .easy to use: consist of many features and tools for maintains
software and hardware.
2 .ascendable: easy synthesized of large data storage ,memory
managements
3. Distributed: works on parallel processing with enormous
number of servers of acquiring minimal time as possible .
As hadoop is open source framework for large scale of
processing the data. written in java language as java is
platform independent so as like hadoop also have code that “
write once read many but not changeable “ hadoop can
processes data fast so its key advantages is its massive
scalability.
B. HDFS Architecture
The hadoop distributed system is the file system components
of the hadoop framework .HDFS is for accessing and storage
the large data for fast retrieval. for storing we have name node
and data node and one other is secondary name node (as name
defines it is backup storage of name node ) and for analytical
we have job tracker and task tracker . on hadoop distributed
file system we only deal with storing of data that is possible
with name node and data node .
4. GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
4
4
HDFS architecture works on master slave terminology in
which one process will be master and other one is slave. So
name node is master and data node is slave. Firstly deals with
name node.
1. NameNode: name node stores the metadata (data about
data) of process with the help of DataNode. File modification,
file permissions and time to access the data. Though
namespace tree file NameNode is directly connected with data
node. NameNode consist of all the information regarding to all
the data that is saved in system and data storing is done by
data node.
Clients send the request to save file. Data node saves that files
to nearest location and name node take all the info about file
of its type, size, address, space and create metadata. Name
node gives instruction to data nodes for every access regarding
to storing and accessing the data. So on this way name node
becomes a master and data become slave.
2. DataNode: DataNode is slave for master node. DataNode
works generally on the ground level to store the data
accurately data node store data in more than one places for
backup of data if any sort of mis-happening of data loss.
Datanode connect to name node and check the id, address of
process if it is not matches then Datanode is closed. particular
interval of time Datanode send the acknowledgment to name
node of having data that to be (block report) accumulated in
that node and so name node has always has an data view for
which cluster where data have been stored .Datanode sends
the heartbeat to name node every minute so name node knows
that Datanode is alive (operating properly) . Heartbeat is a
signal for inform NameNode. If name node not received any
heartbeat then name node assumed that Datanode is dead
creating duplicates nodes name node received many block
reports and heartbeats every minutes and its all many on all by
HDFS architecture with any conflicting in operation of name
node .default size of one block in HDFS is (64mb or 120mb) .
Characteristics of architecture
1. Consist of library files and modules
2. Storing the data on commodity machines (cheap
hardware’s).
C. MapReduced Framework
MapReduced algorithm work on processing the data and
gives some intellect input out of it. Data stored in clusters that
is for parallel processing of hardwares.map reduced consist of
job tracker and task tracker .hadoop system on master slave in
which
Master node contains
*NameNode (HDFS layer)
*Job Trackers (MapReduced)
Slave
Datanode (HDFS layer)
Task tracker (MapReduced)
So in map reduced we talk about job tracker and task tracker .
Job tracker is for scheduling or arranging the jobs that
requested by user. And task tracker is executed job as per the
job tracker (master).
Processing of analytical data
For example: user send a request for analytical data then
firstly request send to the job tracker but it does not have
address of data then job tracker contacts with name node
because it contain complete metadata where data is stored
after getting address fletch by job tracker it gives instruction to
task tracker for accessing and analytical those data as per user
values. And then data is shown .
.
IV. DEMAND OF BIG DATA
Big data is for faster and intelligence decision decisions. Big
data is consisting of 3v that is based on big data set .velocity
volume and verity but this big data not consist of any specific
volume of data. Basically 3vs consist of data on which big
data is based. Consist of 3vs
1. Volume: frequently data increases exponentially rate
.volume refers the size of data. That is manages hadoop
framework on which big data is functioning.
2. Velocity: velocity refers the rate of data. Data velocity is
summon to every field data is coming with large numbers.
3. Variety: data consist of which type of data, today’s
environment user machines generating differs types of data
daily on their requirements like post images audio video, so
we cannot categorized in only one part. So one is structured
data that contain proper row patter and schema and other one
is unstructured data that have randomly images text audio
video so we cannot create schema for that . For example
satellite Facebook twitter daily generates 25 to 15 terabyte of
instructed data respectively.
Variability: as per velocity and variety, variability refers the
uncertainly of producing the data. Not consist of particular
percent of which unstructured data will be going to produce. it
changes every day that is not easy to manage .
For example: on the seasonal periods more and more peoples
uploads images transferring videos, text. That is managed by
variability.
Hadoop is way beyond than RDBMS like
5. GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
5
5
1. Access complex data
2. Managing structure and non structured data
3. Apply distributed parallel processing for accessing
fast
4. Machine learning algorithm
5. Free access software (hadoop)
6. real-time analytical of data using map reduced
algorithm
7. Give intelligence result of the problem
Hadoop was based on distributed parallel processing and
MapReduced algorithm this two are core consistent of
hadoop. complete framework or mote of hadoop is
symbolically depend on Distributed system for storing
and with applying parallel processing for fast retrieval of
data. MapReduced for generating intelligence result by
applying analytics on data .two three years ago. Hadoop is
implemented on big large company like Google yahoo.
they are search engine for storing huge data but now a
day’s every field of computer science departments
including health care , neural technology , sensors using
hadoop framework for managing their data .
Hadoop work on client /server applications that is to access
the data that stored in server. Client sends the request to server
and server response on it. But on this client server relationship
model behind many small processing are interacting in which
non technical user not aware of this. How the data is stored ,
how data retrieve so speedily what are those interesting
processing that is running behind to access data accurately
without any misleading . That all answers known in HDFS
architecture in hadoop based system
V. QUANTITY OF BIG DATA
Today’s world of growing environments with data and other
things. We have discussed till now that data is growing
exponentially. by calculating data that will stored in next 50
years.
Suppose we considered that data is accumulated as ex
.
Let’s take it assumption that data generated by employees in
healthcare =l1
Data generated by social sites =m1
Data generated by satellites =o1
And many others field so on.
So on the year of 2010
l1+m1+o1.............................= x
Data grows exponentially so that on the year 2010 data =ex
As like in year of 2011
l2+m2+o2.............................= y
Data grows exponentially so that year 2011 data =ey
So on the year of 2012
l1+m1+o1.............................= z
Data grows exponentially so that on the year 2012 data =ez
that will be continues to many year(assume next 50 years) ,we
are assuming that it will be about n1 amount of data .
so total data will be on next 50 years will be
Total =ex
+ey
+ez
+..........................en1
. (equation 1)
by calculating ex
in term of series that is equal to
=1+x/1!+x2
/2!+x/3!+x/4!.............= xn
/n!
Same as for value e(power y) as calculating in series
1+y/1!+y2
/2!+y/3!+y/4!.............= yn
/n!
Same as for value e(power z) as calculating in series
1+z/1!+z2
/2!+z/3!+z/4!.............= zn
/n!
Putting values in equation 1. we get the result
=xn
/n!+yn
/n!+ zn
/n!+.................=n1
n
/n
Every next year of data is greater than previous year
So submission of all the values is 1/n![xn
+yn
+
zn
+.................n1
n
]
when the years taken is very high then we can consider the
value of x is in value of y . because as compared to n1 , x is
very small .
So the result will be after 50 years is
Total data accumulated= n1
n
/n!
VI. BIG DATA SECURITY
For storing acquire and analytical data though new technology
is best but there are also certain other aspects
Search engines, socialites, health care they have every type of
6. GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
6
6
data that we have done till now. So others know about our
personal staffs of surfing that come under the unauthorized
access.
It requires particular way to come over this security issue
1. Proper understanding and responsibility takes by
company to not to share private data without users
permission
2. Data transport security
3. Government policies for protocols on accessing data
4. All access given to user for whom to share
5. Cryptography techniques is applied for securing data
6. Authentication, authorization of users
How to convert volume into value because there is not so
necessary that we are storing all this data is beneficial to our
future use. Many of the data are simply wastage of memory
creation. For converting efficient value out of it we creating
new software to solve this big data more focuses on the map
reduced increase the data locality to increase the performance.
VII. CHALLENGES OF BIG DATA
data is offers by Facebook, they have some privacy concern
.you cannot ask Google to give information about what
peoples mostly search .
2 other matter issue is, in the past social science is data is
poor. We talk about quantity and qualities of data become
more mathematical therefore we desire need more
commanding data storage capacity so its very abrading to take
care of all this data. So why big data is driving social media so
fast?
So answer is data storage cost is lower .Facebook and Google
knows what they are doing. But they have some protocols not
to share all this to others. Socialism believes in pure
knowledge they don’t believe in peoples to change their
perspective or manipulating knowledge
VIII. CONCLUSION REMARKS
This paper includes about big data and core concepts how data
is accumulated and increases year by year. Need of storing of
this big data for welfare of analytics and humankind after this
we introduces introduction about traditional system and their
working on data as things changes data storing capacity.
changes comes the concept of hadoop system using distributed
and map reduced method .its also deals with the security of
big data, how data is going everywhere without the permission
of users and challenges of big data for coming years. But the
overall concept of storing the data and performing analytics
for the betterment of users of getting result as fast as possible
for retrieval is amazing. I reviewed on the quantity of Big
Data in further next year.
Working on big data is really a good experience for me
because it contains a new level to think about big data storing
and analyzing.
IX. REFERENCES
1. Zulkernine, F. Martin, P. Ying Zou. Bauer, M.
GwadrySridhar, F. Aboulnaga, A. 2013. Towards
Cloud-Based Analytics-as-a-Service (CLAaaS) for
Big Data Analytics in the Cloud, Big Data (BigData
Congress), IEEE International Congress.
2. Weiyi Shang. Zhen Ming Jiang. Hemmati, H. Adams,
B. Hassan, A.E. Martin, P. 2013. Assisting
developers of Big Data Analytics Applications when
deploying on Hadoop clouds, Software Engineering
(ICSE), 35th International Conference.
3. Mukherjee, A. Datta, J. Jorapur, R. Singhvi, R. Haloi,
S. Akram, W. 2012. Shared disk big data analytics
with Apache Hadoop, High Performance Computing
(HiPC), 19th International Conference.
4. Weiyi Shang. Zhen Ming Jiang. Hemmati, H. Adams,
B. Hassan, A.E. Martin, P. 2013. Assisting
developers of Big Data Analytics Applications when
deploying on Hadoop clouds, Software Engineering
(ICSE), 35th International Conference.
hipi.cs.virginia.edu/ 2014/ 2014