SlideShare a Scribd company logo
1 of 6
Download to read offline
GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
1
1
A Survey
on Managing of Heavily Big Data Through Hadoop
Ecosystems
Neha Trivedi
1
Jayoti Vidhyapeeth Women’s University, Jaipur, India
Department Of Computer Science & Engineering
Abstract: On the year of 2000 new technology is arrived with a full of buzz called the big data .that is grows rapidly in the industry. This paper
strive to offer speed of the growing data, various attribute involve to manage this huge data . Data grows swiftly and swiftly every year report says
every year data is twofold by its previous years. Billions of peoples using internet, on the basis of world’s 99% population have at least one Gmail
account. On an average they access 4 times in a day. There is creating gigantic data to handle. So this paper basically compact with how we
administrate this data for storage and analytics on top of it for efficient and logical result. On year 2009 to 2015 data luxuriate more that of its 50 %.
By increases data so fast gives enormous result. This paper also deals with the various challenges in security, speed, volume. this paper consist of
processing of data, security challenges of big data and hadoop frame work and its working criteria for storing and accessing analytical of this wast data.
Index Terms— analytics, data, security, volume
I. INTRODUCTION
It is start with one progression, in early years data is spawn
and accumulated by a worker. employees of company is
immigrate data into computer that is based on (RDBMS) then
things involved to the internet, then end user can generate their
own data i.e.facebook,LinkedIn they are insinuate their own
data. On the basis of volume it is much larger than first
(RDBMS).by employees insinuate their data to user generate
their own data that is much larger in magnitude .all of the
sudden data accumulated is much higher than traditional
system. So now third level of progression is machine
accumulated their own data that is full of monitor, electricity
,satellite .they are continues involving around earth take
pictures ,data , humidity ,temperatures calculating 24 hours a
day so machine generating data is higher than the end user
data.
1. Employees generating data (RDBMS).
2. User generating data from social sites, mails, accounts.
3 .machine generating data collapse amount of data i.e.
satellite monitor.
Machine > user > employees
Big data generating is from
1. Online shopping
2. Facebook
3. Airlines
4. Hospitality data
Figure:1 big data life cycle for accumulation of data
II. HISTORY OF BIG DATA
In year 2002, Google facing a problem of stockpile large data
and analysis. So Google went to the fundamental is used the
concept of distributed file system save the data in that system.
In 1990 data is exponentially increases.
It faces a problem because manually doing by employees so
after that Google write a white paper is called Google file
system (GFS). Person name duck cutting (inventor of Lucian
search engine) duck cutting work on white paper on Google
on year So Google brought the solution of it. Google file
system (GFS), it took 13 year to come and in year 2004
MapReduced concept is occurred. 2002-04. Yahoo engine also
facing the same problem that was Google facing. Yahoo hired
duck cutting for complete implementation of white paper.
Google solved the problem of storage of bulk data by
applying distributed file system and problem is not yet
finished Google also want to do analytics of data top of it so
Google write another white paper that is Google MapReduced
algorithm for solving the problem of data analysis
T
GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
2
2
combination of both (Google file system and MapReduced)
called as hadoop . hadoop is basically a tool for solving a large
data problems logically .
So that is two core component of hadoop .hadoop is used to
1. save the large file and data (distributed file system)
2 .perform analytics (MapReduced algorithm)
Sources of generating Big Data
Airbus generates a 10TB every 30 minutes and about 640TB
is generated in one flight. A smart meter reads the usage every
15 minutes records 350 billion transaction in a year. In 2009
there were 76 million smart meters by the time of 2017 there
will be more than 250 million smart meters .More than 2
billion people use internet by 2014 CISCO estimates internet
traffic 4.8 ZB per year.200 million Blog entry on the web.300
billion mails send every day .Facebook twitter LinkedIn this
are company generate a data of approx 25TB daily twitter gent
12 TB daily 200 million user generate 230 million tweets daily
97000 tweets are send per second. 2.9 billion Video hour
watched per month. Big data is also coming from trading
sector they produce 1 TB per day. In 2009 total data estimated
1 ZB and in 2020 it will reached 35ZB.
Why big data matters
Big data: right now we are living in data world. We just only
talk about data so important thing is how to store data and how
to process that data.
The data which is beyond to the storing capacity and
processing power that is big data. Data is creating and
congregation everywhere of maximum possible availability
for analysis doing analytics for the big data parallel universe.
big data matter for analytics on data to get proper result for
future use. big data having different perspective for different
business to healthcare to science department to social network
. As like business perceptive many e commerce sites
collecting data previously used by customers so influencing
them to buy their product. On this healthcare we can generate
a record of every disease for betterment of knowing the
growth we have taken to solve the problems on healthcare
Anticipating and influence
Big data plays a very imp role in anticipating and influence
controlling human activity at some accents. hadoop based
software tracks every information that persons doing online
and show it to the users at some intervals of time to again for
purchasing that is big strategies that many company are doing
for selling this product for example at Amazon I am searching
for the book of “DBMS” I select many books but not purchase
anything . after some time when I logon to the other devices I
found again that result is shown to me so this is the way to
online market to sell their product that only could possible for
that they selling products . 400 terabytes a digital library of all
books ever written in any language .200 Yottabytes a
holographic snapshot of the earth surface.
Processing of data
A. Traditional system
In traditional system, they use RDBMS in early stages data is
come over to processor like (CPU, computer chip) to process
the data now there is bulk of data we cannot process like that
in such a way. Then we generator a solution use multiple
processor and bring processor to data having whole row of
server and each server has some small component and we
process each one parallel that is called parallel processing .
“Before data is brought to processor but now processor is
brought to data “. That means
Before data is bring to one CPU but now data bring to
infinitely number of CPU.Data is grows exponentially so we
brought process to that next level for processing and accessing
this bulky data that is called technological shift .there are two
techniques that is allowing for big data that is called
HADOOP .
Hadoop (open source platform and free to use)
Software that is allowed to happen parallel processing and
map reduced.
It is very strenuous to open a high GB of file in machine is
take prolonged time to open. we are facing this problem in
desktop company data, hospitals for analysis of data so that
take huge amount of time to access it and big data of file is
start with terabytes .
So that hadoop comes into the trade for managing terabytes of
data.
Hadoop provide a Underlined distributed system and
Analytical algorithm on that around 2004 -05. Google working
on white paper, yahoo generating MapReduced algorithm. At
the same time IBM also come into a frame they define the
concept of big data in four different ways
1. Velocity (speed of which data is generated)
2. Varity (different type of data audio, video, text)
3. Volume (size of the huge data)
4. Veracity (data generated is useful or not for future use)
B. Traditional database storage of data
traditional system having fixed schema contain only some
class of function based on RDBMS .They have only minimal
amount of function to deal with data. But In early 2004 05
social sites are coming in market and data generated by this
are more and more unstructured that cannot handled by
RDBMS because of two reasons
1. They do not have this much space to store heavily
bulky large data (10%).
2. Data is unstructured (containing text, images (90%),
audio, video,grapths ).
B(i). Limitations of existing traditional solutions
1. Table has fixed schema.
2. High cost
3. Cannot save high huge file (access time higher)
4. Analytical issue.
.
GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
3
3
III. HADOOP AND ITS COMPONENTS
Simple definition of hadoop is:
It is a combination of distributed file system and analytical
algorithm. Beauty of this program is open source data
management system with no cost associated with it
A. Ecosystem of hadoop
Ecosystem is word that refers to the Areas related to field that
is required for that particular thing. Same in the case of
hadoop it composed of HDFS, MapReduced, tools, software
packages like pig, hive.
Components of hadoop
(i).Pig:- pig is the tool for loading the files in system for
manipulating and storing. pig itself explain the meaning of
tool . as pig can eat anything so that Pig Latin language is also
define for every type of language is utilize for storing and
transferring data . pig consist of two things :
Pig Latin (language)
Execution of product (runtime environment)
(ii).Hive –data warehouses deals with small portion of data but
hive covers every population of data. Hive: produces its own
query language (HiveSql). Hive compiled by UDF and
MapReduced.
(iii).Pig Latin data analysis –its abstraction of map reduced
and HDFS data structure
(iv). Mahout Engine –it is an artificial engine that works of
human data for example: by searching some kind of product in
ecommerce website on some browser then close the browser.
After some time again open that browser to do different thing
then again it shows the result of previously searched items. It
basically means that mahout is a artificial engine to understand
the human need and provoke the user to take that product. It is
some kind of strategy applied by ecommerce industry. it is
influenced user to buy that product.
Mahout:-mahout is a library that consist of all the data that
client is accessed previously it basically create a file for
everyone save the content that client is done previously and
uses that data for further use Hive is a platform in the hadoop
ecosystem and
They are 3 things that provide a layer of abstraction on map
reduced algorithm.
Then apache oozie to maintain the curriculum of all the job
that is running on the system.
It helps is to start the job, it helps is to stop the job in the
hadoop world.
It help is to schedule hive job mahout job as soon as possible.
They it is also contains framework
1. Sqoop (structure data i.e. RDBMS)
2. Flume (unstructured data i.e. FB, LinkedIn)
1. Sqoop provides an interesting and very imp facility it takes
the data from RDBMS to put into HDFS storage and vice
versa. it provide very effortless and manageable interface
between RDBMS and HDFS.
2. Flume: it is for unstructured data come to system to HDFS
and HDFS to another place.
Hadoop works on two things
Distributed file system and map reduced system
Hadoop consist two part name node and job tracker. name
node consist storing, managing ,accessing the big data files.
And this part of distributed file system and job tracker is for
analytics of data and this is part of MapReduced.
Name node is master node and data node is a slave node that is
for physically allocation of memory that is directly co related
with name node. NameNode consist of all log files that is
generated with the help of data node.
As same job tracker is master tool and task tracker is salve.
Main work of job tracker is to give instruction to task tracker
for processing the data. job tracker also not aware about the
address where all the data is saved so it send request to name
node for log file so it can get a address to access the data
through task tracker.
Main functions of hadoop
1. Helps to save entire file.
2. MapReduced framework does analytics on the top of
the entire set of file
3. Takes the very small time to do the analytics
4. Scale-out architecture –toggle with the amount of
time it takes for processing.
Hadoop work on master slave method
Name node, data node, secondary name node is for storage of
data. And job tracker and task tracker is for processing the
data. Hadoop is best solution of big data it know very well
how to store the big data and how to process them .
Hadoop: is
1 .easy to use: consist of many features and tools for maintains
software and hardware.
2 .ascendable: easy synthesized of large data storage ,memory
managements
3. Distributed: works on parallel processing with enormous
number of servers of acquiring minimal time as possible .
As hadoop is open source framework for large scale of
processing the data. written in java language as java is
platform independent so as like hadoop also have code that “
write once read many but not changeable “ hadoop can
processes data fast so its key advantages is its massive
scalability.
B. HDFS Architecture
The hadoop distributed system is the file system components
of the hadoop framework .HDFS is for accessing and storage
the large data for fast retrieval. for storing we have name node
and data node and one other is secondary name node (as name
defines it is backup storage of name node ) and for analytical
we have job tracker and task tracker . on hadoop distributed
file system we only deal with storing of data that is possible
with name node and data node .
GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
4
4
HDFS architecture works on master slave terminology in
which one process will be master and other one is slave. So
name node is master and data node is slave. Firstly deals with
name node.
1. NameNode: name node stores the metadata (data about
data) of process with the help of DataNode. File modification,
file permissions and time to access the data. Though
namespace tree file NameNode is directly connected with data
node. NameNode consist of all the information regarding to all
the data that is saved in system and data storing is done by
data node.
Clients send the request to save file. Data node saves that files
to nearest location and name node take all the info about file
of its type, size, address, space and create metadata. Name
node gives instruction to data nodes for every access regarding
to storing and accessing the data. So on this way name node
becomes a master and data become slave.
2. DataNode: DataNode is slave for master node. DataNode
works generally on the ground level to store the data
accurately data node store data in more than one places for
backup of data if any sort of mis-happening of data loss.
Datanode connect to name node and check the id, address of
process if it is not matches then Datanode is closed. particular
interval of time Datanode send the acknowledgment to name
node of having data that to be (block report) accumulated in
that node and so name node has always has an data view for
which cluster where data have been stored .Datanode sends
the heartbeat to name node every minute so name node knows
that Datanode is alive (operating properly) . Heartbeat is a
signal for inform NameNode. If name node not received any
heartbeat then name node assumed that Datanode is dead
creating duplicates nodes name node received many block
reports and heartbeats every minutes and its all many on all by
HDFS architecture with any conflicting in operation of name
node .default size of one block in HDFS is (64mb or 120mb) .
Characteristics of architecture
1. Consist of library files and modules
2. Storing the data on commodity machines (cheap
hardware’s).
C. MapReduced Framework
MapReduced algorithm work on processing the data and
gives some intellect input out of it. Data stored in clusters that
is for parallel processing of hardwares.map reduced consist of
job tracker and task tracker .hadoop system on master slave in
which
Master node contains
*NameNode (HDFS layer)
*Job Trackers (MapReduced)
Slave
Datanode (HDFS layer)
Task tracker (MapReduced)
So in map reduced we talk about job tracker and task tracker .
Job tracker is for scheduling or arranging the jobs that
requested by user. And task tracker is executed job as per the
job tracker (master).
Processing of analytical data
For example: user send a request for analytical data then
firstly request send to the job tracker but it does not have
address of data then job tracker contacts with name node
because it contain complete metadata where data is stored
after getting address fletch by job tracker it gives instruction to
task tracker for accessing and analytical those data as per user
values. And then data is shown .
.
IV. DEMAND OF BIG DATA
Big data is for faster and intelligence decision decisions. Big
data is consisting of 3v that is based on big data set .velocity
volume and verity but this big data not consist of any specific
volume of data. Basically 3vs consist of data on which big
data is based. Consist of 3vs
1. Volume: frequently data increases exponentially rate
.volume refers the size of data. That is manages hadoop
framework on which big data is functioning.
2. Velocity: velocity refers the rate of data. Data velocity is
summon to every field data is coming with large numbers.
3. Variety: data consist of which type of data, today’s
environment user machines generating differs types of data
daily on their requirements like post images audio video, so
we cannot categorized in only one part. So one is structured
data that contain proper row patter and schema and other one
is unstructured data that have randomly images text audio
video so we cannot create schema for that . For example
satellite Facebook twitter daily generates 25 to 15 terabyte of
instructed data respectively.
Variability: as per velocity and variety, variability refers the
uncertainly of producing the data. Not consist of particular
percent of which unstructured data will be going to produce. it
changes every day that is not easy to manage .
For example: on the seasonal periods more and more peoples
uploads images transferring videos, text. That is managed by
variability.
Hadoop is way beyond than RDBMS like
GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
5
5
1. Access complex data
2. Managing structure and non structured data
3. Apply distributed parallel processing for accessing
fast
4. Machine learning algorithm
5. Free access software (hadoop)
6. real-time analytical of data using map reduced
algorithm
7. Give intelligence result of the problem
Hadoop was based on distributed parallel processing and
MapReduced algorithm this two are core consistent of
hadoop. complete framework or mote of hadoop is
symbolically depend on Distributed system for storing
and with applying parallel processing for fast retrieval of
data. MapReduced for generating intelligence result by
applying analytics on data .two three years ago. Hadoop is
implemented on big large company like Google yahoo.
they are search engine for storing huge data but now a
day’s every field of computer science departments
including health care , neural technology , sensors using
hadoop framework for managing their data .
Hadoop work on client /server applications that is to access
the data that stored in server. Client sends the request to server
and server response on it. But on this client server relationship
model behind many small processing are interacting in which
non technical user not aware of this. How the data is stored ,
how data retrieve so speedily what are those interesting
processing that is running behind to access data accurately
without any misleading . That all answers known in HDFS
architecture in hadoop based system
V. QUANTITY OF BIG DATA
Today’s world of growing environments with data and other
things. We have discussed till now that data is growing
exponentially. by calculating data that will stored in next 50
years.
Suppose we considered that data is accumulated as ex
.
Let’s take it assumption that data generated by employees in
healthcare =l1
Data generated by social sites =m1
Data generated by satellites =o1
And many others field so on.
So on the year of 2010
l1+m1+o1.............................= x
Data grows exponentially so that on the year 2010 data =ex
As like in year of 2011
l2+m2+o2.............................= y
Data grows exponentially so that year 2011 data =ey
So on the year of 2012
l1+m1+o1.............................= z
Data grows exponentially so that on the year 2012 data =ez
that will be continues to many year(assume next 50 years) ,we
are assuming that it will be about n1 amount of data .
so total data will be on next 50 years will be
Total =ex
+ey
+ez
+..........................en1
. (equation 1)
by calculating ex
in term of series that is equal to
=1+x/1!+x2
/2!+x/3!+x/4!.............= xn
/n!
Same as for value e(power y) as calculating in series
1+y/1!+y2
/2!+y/3!+y/4!.............= yn
/n!
Same as for value e(power z) as calculating in series
1+z/1!+z2
/2!+z/3!+z/4!.............= zn
/n!
Putting values in equation 1. we get the result
=xn
/n!+yn
/n!+ zn
/n!+.................=n1
n
/n
Every next year of data is greater than previous year
So submission of all the values is 1/n![xn
+yn
+
zn
+.................n1
n
]
when the years taken is very high then we can consider the
value of x is in value of y . because as compared to n1 , x is
very small .
So the result will be after 50 years is
Total data accumulated= n1
n
/n!
VI. BIG DATA SECURITY
For storing acquire and analytical data though new technology
is best but there are also certain other aspects
Search engines, socialites, health care they have every type of
GADL Journal of Recent Innovations in Engineering and Technology (JRIET)
Volume 1 – Issue 1, January, 2016
Neha et al., JRIET www.gadlonline.com
6
6
data that we have done till now. So others know about our
personal staffs of surfing that come under the unauthorized
access.
It requires particular way to come over this security issue
1. Proper understanding and responsibility takes by
company to not to share private data without users
permission
2. Data transport security
3. Government policies for protocols on accessing data
4. All access given to user for whom to share
5. Cryptography techniques is applied for securing data
6. Authentication, authorization of users
How to convert volume into value because there is not so
necessary that we are storing all this data is beneficial to our
future use. Many of the data are simply wastage of memory
creation. For converting efficient value out of it we creating
new software to solve this big data more focuses on the map
reduced increase the data locality to increase the performance.
VII. CHALLENGES OF BIG DATA
data is offers by Facebook, they have some privacy concern
.you cannot ask Google to give information about what
peoples mostly search .
2 other matter issue is, in the past social science is data is
poor. We talk about quantity and qualities of data become
more mathematical therefore we desire need more
commanding data storage capacity so its very abrading to take
care of all this data. So why big data is driving social media so
fast?
So answer is data storage cost is lower .Facebook and Google
knows what they are doing. But they have some protocols not
to share all this to others. Socialism believes in pure
knowledge they don’t believe in peoples to change their
perspective or manipulating knowledge
VIII. CONCLUSION REMARKS
This paper includes about big data and core concepts how data
is accumulated and increases year by year. Need of storing of
this big data for welfare of analytics and humankind after this
we introduces introduction about traditional system and their
working on data as things changes data storing capacity.
changes comes the concept of hadoop system using distributed
and map reduced method .its also deals with the security of
big data, how data is going everywhere without the permission
of users and challenges of big data for coming years. But the
overall concept of storing the data and performing analytics
for the betterment of users of getting result as fast as possible
for retrieval is amazing. I reviewed on the quantity of Big
Data in further next year.
Working on big data is really a good experience for me
because it contains a new level to think about big data storing
and analyzing.
IX. REFERENCES
1. Zulkernine, F. Martin, P. Ying Zou. Bauer, M.
GwadrySridhar, F. Aboulnaga, A. 2013. Towards
Cloud-Based Analytics-as-a-Service (CLAaaS) for
Big Data Analytics in the Cloud, Big Data (BigData
Congress), IEEE International Congress.
2. Weiyi Shang. Zhen Ming Jiang. Hemmati, H. Adams,
B. Hassan, A.E. Martin, P. 2013. Assisting
developers of Big Data Analytics Applications when
deploying on Hadoop clouds, Software Engineering
(ICSE), 35th International Conference.
3. Mukherjee, A. Datta, J. Jorapur, R. Singhvi, R. Haloi,
S. Akram, W. 2012. Shared disk big data analytics
with Apache Hadoop, High Performance Computing
(HiPC), 19th International Conference.
4. Weiyi Shang. Zhen Ming Jiang. Hemmati, H. Adams,
B. Hassan, A.E. Martin, P. 2013. Assisting
developers of Big Data Analytics Applications when
deploying on Hadoop clouds, Software Engineering
(ICSE), 35th International Conference.
hipi.cs.virginia.edu/ 2014/ 2014

More Related Content

What's hot

An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Big Data & Machine Learning
Big Data & Machine LearningBig Data & Machine Learning
Big Data & Machine LearningAngelo Mariano
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremAnthonyOtuonye
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applicationsIJARIIT
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersManjeet Singh Nagi
 
Implementation of application for huge data file transfer
Implementation of application for huge data file transferImplementation of application for huge data file transfer
Implementation of application for huge data file transferijwmn
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayXoriant Corporation
 
Security issues associated with big data in cloud computing
Security issues associated with big data in cloud computingSecurity issues associated with big data in cloud computing
Security issues associated with big data in cloud computingIJNSA Journal
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 
Big Data, Big Opportunities
Big Data, Big OpportunitiesBig Data, Big Opportunities
Big Data, Big OpportunitiesArimo, Inc.
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
A Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsA Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsIRJET Journal
 

What's hot (20)

Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Data mining on big data
Data mining on big dataData mining on big data
Data mining on big data
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Big Data & Machine Learning
Big Data & Machine LearningBig Data & Machine Learning
Big Data & Machine Learning
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE Theorem
 
Big data mining
Big data miningBig data mining
Big data mining
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applications
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managers
 
Implementation of application for huge data file transfer
Implementation of application for huge data file transferImplementation of application for huge data file transfer
Implementation of application for huge data file transfer
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop Way
 
Security issues associated with big data in cloud computing
Security issues associated with big data in cloud computingSecurity issues associated with big data in cloud computing
Security issues associated with big data in cloud computing
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 
Big Data, Big Opportunities
Big Data, Big OpportunitiesBig Data, Big Opportunities
Big Data, Big Opportunities
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
A Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsA Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and Trends
 

Similar to Managing Big Data Through Hadoop Ecosystems

What is big data
What is big dataWhat is big data
What is big dataShubShubi
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Love Arora
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A reviewShilpa Soi
 
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsWhitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsHappiest Minds Technologies
 

Similar to Managing Big Data Through Hadoop Ecosystems (20)

Big data
Big dataBig data
Big data
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
What is big data
What is big dataWhat is big data
What is big data
 
Big Data
Big DataBig Data
Big Data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
Big data
Big dataBig data
Big data
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A review
 
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest MindsWhitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
Whitepaper: Know Your Big Data – in 10 Minutes! - Happiest Minds
 
Big Data
Big DataBig Data
Big Data
 

Managing Big Data Through Hadoop Ecosystems

  • 1. GADL Journal of Recent Innovations in Engineering and Technology (JRIET) Volume 1 – Issue 1, January, 2016 Neha et al., JRIET www.gadlonline.com 1 1 A Survey on Managing of Heavily Big Data Through Hadoop Ecosystems Neha Trivedi 1 Jayoti Vidhyapeeth Women’s University, Jaipur, India Department Of Computer Science & Engineering Abstract: On the year of 2000 new technology is arrived with a full of buzz called the big data .that is grows rapidly in the industry. This paper strive to offer speed of the growing data, various attribute involve to manage this huge data . Data grows swiftly and swiftly every year report says every year data is twofold by its previous years. Billions of peoples using internet, on the basis of world’s 99% population have at least one Gmail account. On an average they access 4 times in a day. There is creating gigantic data to handle. So this paper basically compact with how we administrate this data for storage and analytics on top of it for efficient and logical result. On year 2009 to 2015 data luxuriate more that of its 50 %. By increases data so fast gives enormous result. This paper also deals with the various challenges in security, speed, volume. this paper consist of processing of data, security challenges of big data and hadoop frame work and its working criteria for storing and accessing analytical of this wast data. Index Terms— analytics, data, security, volume I. INTRODUCTION It is start with one progression, in early years data is spawn and accumulated by a worker. employees of company is immigrate data into computer that is based on (RDBMS) then things involved to the internet, then end user can generate their own data i.e.facebook,LinkedIn they are insinuate their own data. On the basis of volume it is much larger than first (RDBMS).by employees insinuate their data to user generate their own data that is much larger in magnitude .all of the sudden data accumulated is much higher than traditional system. So now third level of progression is machine accumulated their own data that is full of monitor, electricity ,satellite .they are continues involving around earth take pictures ,data , humidity ,temperatures calculating 24 hours a day so machine generating data is higher than the end user data. 1. Employees generating data (RDBMS). 2. User generating data from social sites, mails, accounts. 3 .machine generating data collapse amount of data i.e. satellite monitor. Machine > user > employees Big data generating is from 1. Online shopping 2. Facebook 3. Airlines 4. Hospitality data Figure:1 big data life cycle for accumulation of data II. HISTORY OF BIG DATA In year 2002, Google facing a problem of stockpile large data and analysis. So Google went to the fundamental is used the concept of distributed file system save the data in that system. In 1990 data is exponentially increases. It faces a problem because manually doing by employees so after that Google write a white paper is called Google file system (GFS). Person name duck cutting (inventor of Lucian search engine) duck cutting work on white paper on Google on year So Google brought the solution of it. Google file system (GFS), it took 13 year to come and in year 2004 MapReduced concept is occurred. 2002-04. Yahoo engine also facing the same problem that was Google facing. Yahoo hired duck cutting for complete implementation of white paper. Google solved the problem of storage of bulk data by applying distributed file system and problem is not yet finished Google also want to do analytics of data top of it so Google write another white paper that is Google MapReduced algorithm for solving the problem of data analysis T
  • 2. GADL Journal of Recent Innovations in Engineering and Technology (JRIET) Volume 1 – Issue 1, January, 2016 Neha et al., JRIET www.gadlonline.com 2 2 combination of both (Google file system and MapReduced) called as hadoop . hadoop is basically a tool for solving a large data problems logically . So that is two core component of hadoop .hadoop is used to 1. save the large file and data (distributed file system) 2 .perform analytics (MapReduced algorithm) Sources of generating Big Data Airbus generates a 10TB every 30 minutes and about 640TB is generated in one flight. A smart meter reads the usage every 15 minutes records 350 billion transaction in a year. In 2009 there were 76 million smart meters by the time of 2017 there will be more than 250 million smart meters .More than 2 billion people use internet by 2014 CISCO estimates internet traffic 4.8 ZB per year.200 million Blog entry on the web.300 billion mails send every day .Facebook twitter LinkedIn this are company generate a data of approx 25TB daily twitter gent 12 TB daily 200 million user generate 230 million tweets daily 97000 tweets are send per second. 2.9 billion Video hour watched per month. Big data is also coming from trading sector they produce 1 TB per day. In 2009 total data estimated 1 ZB and in 2020 it will reached 35ZB. Why big data matters Big data: right now we are living in data world. We just only talk about data so important thing is how to store data and how to process that data. The data which is beyond to the storing capacity and processing power that is big data. Data is creating and congregation everywhere of maximum possible availability for analysis doing analytics for the big data parallel universe. big data matter for analytics on data to get proper result for future use. big data having different perspective for different business to healthcare to science department to social network . As like business perceptive many e commerce sites collecting data previously used by customers so influencing them to buy their product. On this healthcare we can generate a record of every disease for betterment of knowing the growth we have taken to solve the problems on healthcare Anticipating and influence Big data plays a very imp role in anticipating and influence controlling human activity at some accents. hadoop based software tracks every information that persons doing online and show it to the users at some intervals of time to again for purchasing that is big strategies that many company are doing for selling this product for example at Amazon I am searching for the book of “DBMS” I select many books but not purchase anything . after some time when I logon to the other devices I found again that result is shown to me so this is the way to online market to sell their product that only could possible for that they selling products . 400 terabytes a digital library of all books ever written in any language .200 Yottabytes a holographic snapshot of the earth surface. Processing of data A. Traditional system In traditional system, they use RDBMS in early stages data is come over to processor like (CPU, computer chip) to process the data now there is bulk of data we cannot process like that in such a way. Then we generator a solution use multiple processor and bring processor to data having whole row of server and each server has some small component and we process each one parallel that is called parallel processing . “Before data is brought to processor but now processor is brought to data “. That means Before data is bring to one CPU but now data bring to infinitely number of CPU.Data is grows exponentially so we brought process to that next level for processing and accessing this bulky data that is called technological shift .there are two techniques that is allowing for big data that is called HADOOP . Hadoop (open source platform and free to use) Software that is allowed to happen parallel processing and map reduced. It is very strenuous to open a high GB of file in machine is take prolonged time to open. we are facing this problem in desktop company data, hospitals for analysis of data so that take huge amount of time to access it and big data of file is start with terabytes . So that hadoop comes into the trade for managing terabytes of data. Hadoop provide a Underlined distributed system and Analytical algorithm on that around 2004 -05. Google working on white paper, yahoo generating MapReduced algorithm. At the same time IBM also come into a frame they define the concept of big data in four different ways 1. Velocity (speed of which data is generated) 2. Varity (different type of data audio, video, text) 3. Volume (size of the huge data) 4. Veracity (data generated is useful or not for future use) B. Traditional database storage of data traditional system having fixed schema contain only some class of function based on RDBMS .They have only minimal amount of function to deal with data. But In early 2004 05 social sites are coming in market and data generated by this are more and more unstructured that cannot handled by RDBMS because of two reasons 1. They do not have this much space to store heavily bulky large data (10%). 2. Data is unstructured (containing text, images (90%), audio, video,grapths ). B(i). Limitations of existing traditional solutions 1. Table has fixed schema. 2. High cost 3. Cannot save high huge file (access time higher) 4. Analytical issue. .
  • 3. GADL Journal of Recent Innovations in Engineering and Technology (JRIET) Volume 1 – Issue 1, January, 2016 Neha et al., JRIET www.gadlonline.com 3 3 III. HADOOP AND ITS COMPONENTS Simple definition of hadoop is: It is a combination of distributed file system and analytical algorithm. Beauty of this program is open source data management system with no cost associated with it A. Ecosystem of hadoop Ecosystem is word that refers to the Areas related to field that is required for that particular thing. Same in the case of hadoop it composed of HDFS, MapReduced, tools, software packages like pig, hive. Components of hadoop (i).Pig:- pig is the tool for loading the files in system for manipulating and storing. pig itself explain the meaning of tool . as pig can eat anything so that Pig Latin language is also define for every type of language is utilize for storing and transferring data . pig consist of two things : Pig Latin (language) Execution of product (runtime environment) (ii).Hive –data warehouses deals with small portion of data but hive covers every population of data. Hive: produces its own query language (HiveSql). Hive compiled by UDF and MapReduced. (iii).Pig Latin data analysis –its abstraction of map reduced and HDFS data structure (iv). Mahout Engine –it is an artificial engine that works of human data for example: by searching some kind of product in ecommerce website on some browser then close the browser. After some time again open that browser to do different thing then again it shows the result of previously searched items. It basically means that mahout is a artificial engine to understand the human need and provoke the user to take that product. It is some kind of strategy applied by ecommerce industry. it is influenced user to buy that product. Mahout:-mahout is a library that consist of all the data that client is accessed previously it basically create a file for everyone save the content that client is done previously and uses that data for further use Hive is a platform in the hadoop ecosystem and They are 3 things that provide a layer of abstraction on map reduced algorithm. Then apache oozie to maintain the curriculum of all the job that is running on the system. It helps is to start the job, it helps is to stop the job in the hadoop world. It help is to schedule hive job mahout job as soon as possible. They it is also contains framework 1. Sqoop (structure data i.e. RDBMS) 2. Flume (unstructured data i.e. FB, LinkedIn) 1. Sqoop provides an interesting and very imp facility it takes the data from RDBMS to put into HDFS storage and vice versa. it provide very effortless and manageable interface between RDBMS and HDFS. 2. Flume: it is for unstructured data come to system to HDFS and HDFS to another place. Hadoop works on two things Distributed file system and map reduced system Hadoop consist two part name node and job tracker. name node consist storing, managing ,accessing the big data files. And this part of distributed file system and job tracker is for analytics of data and this is part of MapReduced. Name node is master node and data node is a slave node that is for physically allocation of memory that is directly co related with name node. NameNode consist of all log files that is generated with the help of data node. As same job tracker is master tool and task tracker is salve. Main work of job tracker is to give instruction to task tracker for processing the data. job tracker also not aware about the address where all the data is saved so it send request to name node for log file so it can get a address to access the data through task tracker. Main functions of hadoop 1. Helps to save entire file. 2. MapReduced framework does analytics on the top of the entire set of file 3. Takes the very small time to do the analytics 4. Scale-out architecture –toggle with the amount of time it takes for processing. Hadoop work on master slave method Name node, data node, secondary name node is for storage of data. And job tracker and task tracker is for processing the data. Hadoop is best solution of big data it know very well how to store the big data and how to process them . Hadoop: is 1 .easy to use: consist of many features and tools for maintains software and hardware. 2 .ascendable: easy synthesized of large data storage ,memory managements 3. Distributed: works on parallel processing with enormous number of servers of acquiring minimal time as possible . As hadoop is open source framework for large scale of processing the data. written in java language as java is platform independent so as like hadoop also have code that “ write once read many but not changeable “ hadoop can processes data fast so its key advantages is its massive scalability. B. HDFS Architecture The hadoop distributed system is the file system components of the hadoop framework .HDFS is for accessing and storage the large data for fast retrieval. for storing we have name node and data node and one other is secondary name node (as name defines it is backup storage of name node ) and for analytical we have job tracker and task tracker . on hadoop distributed file system we only deal with storing of data that is possible with name node and data node .
  • 4. GADL Journal of Recent Innovations in Engineering and Technology (JRIET) Volume 1 – Issue 1, January, 2016 Neha et al., JRIET www.gadlonline.com 4 4 HDFS architecture works on master slave terminology in which one process will be master and other one is slave. So name node is master and data node is slave. Firstly deals with name node. 1. NameNode: name node stores the metadata (data about data) of process with the help of DataNode. File modification, file permissions and time to access the data. Though namespace tree file NameNode is directly connected with data node. NameNode consist of all the information regarding to all the data that is saved in system and data storing is done by data node. Clients send the request to save file. Data node saves that files to nearest location and name node take all the info about file of its type, size, address, space and create metadata. Name node gives instruction to data nodes for every access regarding to storing and accessing the data. So on this way name node becomes a master and data become slave. 2. DataNode: DataNode is slave for master node. DataNode works generally on the ground level to store the data accurately data node store data in more than one places for backup of data if any sort of mis-happening of data loss. Datanode connect to name node and check the id, address of process if it is not matches then Datanode is closed. particular interval of time Datanode send the acknowledgment to name node of having data that to be (block report) accumulated in that node and so name node has always has an data view for which cluster where data have been stored .Datanode sends the heartbeat to name node every minute so name node knows that Datanode is alive (operating properly) . Heartbeat is a signal for inform NameNode. If name node not received any heartbeat then name node assumed that Datanode is dead creating duplicates nodes name node received many block reports and heartbeats every minutes and its all many on all by HDFS architecture with any conflicting in operation of name node .default size of one block in HDFS is (64mb or 120mb) . Characteristics of architecture 1. Consist of library files and modules 2. Storing the data on commodity machines (cheap hardware’s). C. MapReduced Framework MapReduced algorithm work on processing the data and gives some intellect input out of it. Data stored in clusters that is for parallel processing of hardwares.map reduced consist of job tracker and task tracker .hadoop system on master slave in which Master node contains *NameNode (HDFS layer) *Job Trackers (MapReduced) Slave Datanode (HDFS layer) Task tracker (MapReduced) So in map reduced we talk about job tracker and task tracker . Job tracker is for scheduling or arranging the jobs that requested by user. And task tracker is executed job as per the job tracker (master). Processing of analytical data For example: user send a request for analytical data then firstly request send to the job tracker but it does not have address of data then job tracker contacts with name node because it contain complete metadata where data is stored after getting address fletch by job tracker it gives instruction to task tracker for accessing and analytical those data as per user values. And then data is shown . . IV. DEMAND OF BIG DATA Big data is for faster and intelligence decision decisions. Big data is consisting of 3v that is based on big data set .velocity volume and verity but this big data not consist of any specific volume of data. Basically 3vs consist of data on which big data is based. Consist of 3vs 1. Volume: frequently data increases exponentially rate .volume refers the size of data. That is manages hadoop framework on which big data is functioning. 2. Velocity: velocity refers the rate of data. Data velocity is summon to every field data is coming with large numbers. 3. Variety: data consist of which type of data, today’s environment user machines generating differs types of data daily on their requirements like post images audio video, so we cannot categorized in only one part. So one is structured data that contain proper row patter and schema and other one is unstructured data that have randomly images text audio video so we cannot create schema for that . For example satellite Facebook twitter daily generates 25 to 15 terabyte of instructed data respectively. Variability: as per velocity and variety, variability refers the uncertainly of producing the data. Not consist of particular percent of which unstructured data will be going to produce. it changes every day that is not easy to manage . For example: on the seasonal periods more and more peoples uploads images transferring videos, text. That is managed by variability. Hadoop is way beyond than RDBMS like
  • 5. GADL Journal of Recent Innovations in Engineering and Technology (JRIET) Volume 1 – Issue 1, January, 2016 Neha et al., JRIET www.gadlonline.com 5 5 1. Access complex data 2. Managing structure and non structured data 3. Apply distributed parallel processing for accessing fast 4. Machine learning algorithm 5. Free access software (hadoop) 6. real-time analytical of data using map reduced algorithm 7. Give intelligence result of the problem Hadoop was based on distributed parallel processing and MapReduced algorithm this two are core consistent of hadoop. complete framework or mote of hadoop is symbolically depend on Distributed system for storing and with applying parallel processing for fast retrieval of data. MapReduced for generating intelligence result by applying analytics on data .two three years ago. Hadoop is implemented on big large company like Google yahoo. they are search engine for storing huge data but now a day’s every field of computer science departments including health care , neural technology , sensors using hadoop framework for managing their data . Hadoop work on client /server applications that is to access the data that stored in server. Client sends the request to server and server response on it. But on this client server relationship model behind many small processing are interacting in which non technical user not aware of this. How the data is stored , how data retrieve so speedily what are those interesting processing that is running behind to access data accurately without any misleading . That all answers known in HDFS architecture in hadoop based system V. QUANTITY OF BIG DATA Today’s world of growing environments with data and other things. We have discussed till now that data is growing exponentially. by calculating data that will stored in next 50 years. Suppose we considered that data is accumulated as ex . Let’s take it assumption that data generated by employees in healthcare =l1 Data generated by social sites =m1 Data generated by satellites =o1 And many others field so on. So on the year of 2010 l1+m1+o1.............................= x Data grows exponentially so that on the year 2010 data =ex As like in year of 2011 l2+m2+o2.............................= y Data grows exponentially so that year 2011 data =ey So on the year of 2012 l1+m1+o1.............................= z Data grows exponentially so that on the year 2012 data =ez that will be continues to many year(assume next 50 years) ,we are assuming that it will be about n1 amount of data . so total data will be on next 50 years will be Total =ex +ey +ez +..........................en1 . (equation 1) by calculating ex in term of series that is equal to =1+x/1!+x2 /2!+x/3!+x/4!.............= xn /n! Same as for value e(power y) as calculating in series 1+y/1!+y2 /2!+y/3!+y/4!.............= yn /n! Same as for value e(power z) as calculating in series 1+z/1!+z2 /2!+z/3!+z/4!.............= zn /n! Putting values in equation 1. we get the result =xn /n!+yn /n!+ zn /n!+.................=n1 n /n Every next year of data is greater than previous year So submission of all the values is 1/n![xn +yn + zn +.................n1 n ] when the years taken is very high then we can consider the value of x is in value of y . because as compared to n1 , x is very small . So the result will be after 50 years is Total data accumulated= n1 n /n! VI. BIG DATA SECURITY For storing acquire and analytical data though new technology is best but there are also certain other aspects Search engines, socialites, health care they have every type of
  • 6. GADL Journal of Recent Innovations in Engineering and Technology (JRIET) Volume 1 – Issue 1, January, 2016 Neha et al., JRIET www.gadlonline.com 6 6 data that we have done till now. So others know about our personal staffs of surfing that come under the unauthorized access. It requires particular way to come over this security issue 1. Proper understanding and responsibility takes by company to not to share private data without users permission 2. Data transport security 3. Government policies for protocols on accessing data 4. All access given to user for whom to share 5. Cryptography techniques is applied for securing data 6. Authentication, authorization of users How to convert volume into value because there is not so necessary that we are storing all this data is beneficial to our future use. Many of the data are simply wastage of memory creation. For converting efficient value out of it we creating new software to solve this big data more focuses on the map reduced increase the data locality to increase the performance. VII. CHALLENGES OF BIG DATA data is offers by Facebook, they have some privacy concern .you cannot ask Google to give information about what peoples mostly search . 2 other matter issue is, in the past social science is data is poor. We talk about quantity and qualities of data become more mathematical therefore we desire need more commanding data storage capacity so its very abrading to take care of all this data. So why big data is driving social media so fast? So answer is data storage cost is lower .Facebook and Google knows what they are doing. But they have some protocols not to share all this to others. Socialism believes in pure knowledge they don’t believe in peoples to change their perspective or manipulating knowledge VIII. CONCLUSION REMARKS This paper includes about big data and core concepts how data is accumulated and increases year by year. Need of storing of this big data for welfare of analytics and humankind after this we introduces introduction about traditional system and their working on data as things changes data storing capacity. changes comes the concept of hadoop system using distributed and map reduced method .its also deals with the security of big data, how data is going everywhere without the permission of users and challenges of big data for coming years. But the overall concept of storing the data and performing analytics for the betterment of users of getting result as fast as possible for retrieval is amazing. I reviewed on the quantity of Big Data in further next year. Working on big data is really a good experience for me because it contains a new level to think about big data storing and analyzing. IX. REFERENCES 1. Zulkernine, F. Martin, P. Ying Zou. Bauer, M. GwadrySridhar, F. Aboulnaga, A. 2013. Towards Cloud-Based Analytics-as-a-Service (CLAaaS) for Big Data Analytics in the Cloud, Big Data (BigData Congress), IEEE International Congress. 2. Weiyi Shang. Zhen Ming Jiang. Hemmati, H. Adams, B. Hassan, A.E. Martin, P. 2013. Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds, Software Engineering (ICSE), 35th International Conference. 3. Mukherjee, A. Datta, J. Jorapur, R. Singhvi, R. Haloi, S. Akram, W. 2012. Shared disk big data analytics with Apache Hadoop, High Performance Computing (HiPC), 19th International Conference. 4. Weiyi Shang. Zhen Ming Jiang. Hemmati, H. Adams, B. Hassan, A.E. Martin, P. 2013. Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds, Software Engineering (ICSE), 35th International Conference. hipi.cs.virginia.edu/ 2014/ 2014