Big data & hadoop Introduction

Big Data & Hadoop
By: Mohit Shukla
Email: mohit.shukla@walkwel.in
Software Engineer

Big Data
❏ In today’s modern world we are surrounded by the data.
❏ What we are doing, we are storing the data and processing the data.
❏ So what is big data ?
❏ The data which is beyond the storage capacity and beyond the processing
power is called big data.
❏ This data is generated from different resource like Sensors, cc cam , social
networks like facebook, twitter, online shopping, airline etc.
❏ By these factors we are getting the huge data.

Big Data
100 % data
90 % data
Generated from last 2 years
10 % data
Generated from the starting

Big Data
System variations from the few years
Yr : 1990 Yr : 2017
HDD Capacity : 1GB- 20 GB
Ram: 64-128 mb
Reading capacity is : 10 kbps
HDD Capacity : min 1Tb
Ram: 4 -16 Gb
Reading capacity is : 100 mbps
Note: from 1990 to 2017 what happen is our HDD capacity is increased by 100 times and similar
the Ram and reading capacity of System

Big Data
Challenges
To understand challenges let's take an example
Farmer
Farmer Field
Suppose in 1st year he generated 10 rice packet
2nd -> 20 rice packet
3rd -> 1000 rice pack
Farmer storage Room(800 rice pck)
20 Rice packet
10 Rice pck

Big Data
Challenges
● Problem is farmer have a limited storage, if he generated 10-800 rice packed then he can storage it to his storage
room.
● What if he generated 1000 rice packed ?
● well he can’t store it in that room.
● Similar with the data if we have proper storage capacity then we can store it but if we not, than what ?
● Then we have to storage that data in some other place.
● So in farmer case if he generated 1000 rice pck then he don’t have any room to store it so what he do .. he have to go
to some godown or wharehouse. For storage the rice packet in there.

Big Data
Challenges
Farmer
Warehouse

Big Data
Challenges
Similar thing happen with the data storage, if we don’t have storage capacity for huge data then we have to go for data
centers to store that data..
Data Centers
What are these data centers: These are the people having servers to store your data they
may be IBM servers, EMC servers or any other. Here we can store our data. And whenever
we want to process it, we can get our data from these data centers to our local system and
then we can process it in our local system.

Big Data
Challenges
Why we are storing that data:
● Suppose i have one movie name “Titanic”, if i don’t want to watch that film why should i store it in my system.
● Coze may be some other day i watch that film.
● So if we don’t want to process the data then why should we store it.
● We are storing it because maybe some other day we process that data.

Big Data
Challenges
How we can process that data :
By written some code in java,mysql, C# or any other programming language.
Data Centers with 1000
Tb storage
Here i save 100 tb of my data
But want to process 2 tb data
of 100 tb data
To achieve this i write 100kb
of code.
So which is better way now
sending this 100 kb file to
datacenter for processing the
data or getting 2 TB data to
our local system and then
process it ?Size: 100 kb

Big Data
Before Hadoop
Obviously sending 100 kb file to data centers is better
● Even it is very easy to send 100 kb file to data center, but we should not send program to our datacenter
● Why it is?
● Before hadoop the basic part is Computation is processor bound.
What is this (Computation is processor bound ) ->
● computation is a program which you wanted to run on your data.
● So what exactly it is ?
● wherever you writing the program for that system only you have to fetch the data and process it on that system.
That is the only technique we have before hadoop. That means we can’t send our program to our datacenter to
process the data.
● what we can do we can fetch 2tb data to our local system and then by our computation program we can process it.

Big Data
Data Forms
Three V (Volume, velocity and Variety)
1: volume : (data size GB,TB,Peta Bytes) rapidly increasing in this form.
2: Velocity: This huge data creates some velocity problem too.
3: Variety: data is in different forms.
Note: In present time we have RDBMS for storing the relational data/structure data, so this is the technique by which we only store
our structure data or process structure data.
We can divide data into three forms
● Structured Data
● Unstructured Data
● Semistructured Data
If we are getting 100 % data so in this data 70% - 80% data is unstructured data or semi structured data. Rest 20 - 30% is
structured data.

Big Data
Data Forms
Question: How we are generating this unstructured and semi structured data is there any example ?
Answer : Yes
Unstructured data : Facebook videos, images, text message, audio.
Semi structured data: Log files. Suppose i have 2-3 mail account(gamil, yahoo) for all world most of the people have these
account. every account have their log files and store it on there gmail /yahoo servers etc.
4 gmail account * 5 times open a day = generate 20 log files (1 user)
2 yahoo mail * 5 times = 10 log files (1 user)
If i have other account like facebook, google+, instagram then suppose I generate 70 log files in a day.
so what about other people's in the world. These log files have lots of data.

Big Data
Definition
What is big data in other term :
Question : Suppose we have a system of 500 gb capacity if i wanted to store 300 gb it is possible.
Answer: Yes
Question: if i wanted to store 400 gb is it possible:
Answer: yes
Question: If i wanted to store 600 gb is it possible
Answer: no
If we can not store, we can not process so “The data which is beyond the storage capacity and beyond the processing power is
called big data.”

Big Data
Hadoop
● So simply if we have 500 gb storage capacity and we are storing 300 gb data and want to process it then.
● Question is how much time it takes
● Definitely it may take some time ..
● Here the hadoop comes into picture
Suppose i want to construct my home
One worker
Takes 1 yr to build

Big Data
Hadoop
3 worker
Takes 1-2 months to build
So the thing is that if we split the job to multiple guys we can complete that work in less time.

Big Data
Hadoop
● As the data is rapidly increasing suppose we get 1 tb data
● Our storage capacity is now 2 tb.
● Then we simply store it but what if we want to process 700 gb of data form that?
● It take much and more time. As we are using single system.
● So what we do?
● Instead of giving this 700 gb of data to one system.
● We simply divide this data and allocate on different system.
● So each and every system is working parallel and give the output in less time.
● Sounds Good

Big Data
Hadoop (Dividing work)
We can understand it using an example
Suppose I am in office and my attender has brought 100 files
i can process 50 files
In a day
50 files Pending
i can process 50 files
In a day
150 files Pending
100 files
100 more files

Big Data
Hadoop (Dividing work)
So instead of allocating work to single guy we should allocate work to 2-3 guys.
what we are trying to achieve here ?
We are trying to achieve here the processing power, means if we have huge data then we have some methods to process
that data in feasible amount of time.
Hadoop: hadoop know very well how to store huge data and how to process that huge data in less no. of time.

Hadoop
History of hadoop
● We all heard about (is a search engine). What it do is, it store data and when we search then it gives
us top results.
● In 1990, Google started working on how to store huge data and how to process it
● Google take 13 years for that and in 2003 he came with and idea
● He gives two things one is GFS (Google file system) and other one is (MapReduce)
● GFS is basically used for storing the huge amount of data
● And MapReduce is a technique by which we can process the data stored in the GFS
● But all of these two technologies are only in white paper it doesn’t implemented by the google
Not then Who ?
● it is also a largest web search engine after google.
● They guys also work on how to store that huge data and process it
● They guys take google white paper and started implemented it and give
● 2006-2007, HDFS (Hadoop distributed file system)
● 2007-2008, MapReduce (processing technique)
● These are two core concepts in hadoop (HDFS and MapReduce)

Hadoop
Definition
Who is the inventor of hadoop : Doug Cutting
“Hadoop is a open source framework given by apache software foundation basically overseen by apache software
foundation for storing huge data sets and for processing huge data sets with the cluster of commodity hardware
”
HDFS : Is a technique for storing our huge data with the help of commodity hardware.
MapReduce: Is a technique for processing the data which is stored in HDFS.

Big data & hadoop Introduction

Recommended

Recommended

More Related Content

Similar to Big data & hadoop Introduction

Similar to Big data & hadoop Introduction (20)

Recently uploaded

Recently uploaded (20)

Big data & hadoop Introduction