SlideShare a Scribd company logo
Big Data & Hadoop
By: Mohit Shukla
Email: mohit.shukla@walkwel.in
Software Engineer
Big Data
❏ In today’s modern world we are surrounded by the data.
❏ What we are doing, we are storing the data and processing the data.
❏ So what is big data ?
❏ The data which is beyond the storage capacity and beyond the processing
power is called big data.
❏ This data is generated from different resource like Sensors, cc cam , social
networks like facebook, twitter, online shopping, airline etc.
❏ By these factors we are getting the huge data.
Big Data
100 % data
90 % data
Generated from last 2 years
10 % data
Generated from the starting
Big Data
System variations from the few years
Yr : 1990 Yr : 2017
HDD Capacity : 1GB- 20 GB
Ram: 64-128 mb
Reading capacity is : 10 kbps
HDD Capacity : min 1Tb
Ram: 4 -16 Gb
Reading capacity is : 100 mbps
Note: from 1990 to 2017 what happen is our HDD capacity is increased by 100 times and similar
the Ram and reading capacity of System
Big Data
Challenges
To understand challenges let's take an example
Farmer
Farmer Field
Suppose in 1st year he generated 10 rice packet
2nd -> 20 rice packet
3rd -> 1000 rice pack
Farmer storage Room(800 rice pck)
20 Rice packet
10 Rice pck
Big Data
Challenges
● Problem is farmer have a limited storage, if he generated 10-800 rice packed then he can storage it to his storage
room.
● What if he generated 1000 rice packed ?
● well he can’t store it in that room.
● Similar with the data if we have proper storage capacity then we can store it but if we not, than what ?
● Then we have to storage that data in some other place.
● So in farmer case if he generated 1000 rice pck then he don’t have any room to store it so what he do .. he have to go
to some godown or wharehouse. For storage the rice packet in there.
Big Data
Challenges
Farmer
Warehouse
Big Data
Challenges
Similar thing happen with the data storage, if we don’t have storage capacity for huge data then we have to go for data
centers to store that data..
Data Centers
What are these data centers: These are the people having servers to store your data they
may be IBM servers, EMC servers or any other. Here we can store our data. And whenever
we want to process it, we can get our data from these data centers to our local system and
then we can process it in our local system.
Big Data
Challenges
Why we are storing that data:
● Suppose i have one movie name “Titanic”, if i don’t want to watch that film why should i store it in my system.
● Coze may be some other day i watch that film.
● So if we don’t want to process the data then why should we store it.
● We are storing it because maybe some other day we process that data.
Big Data
Challenges
How we can process that data :
By written some code in java,mysql, C# or any other programming language.
Data Centers with 1000
Tb storage
Here i save 100 tb of my data
But want to process 2 tb data
of 100 tb data
To achieve this i write 100kb
of code.
So which is better way now
sending this 100 kb file to
datacenter for processing the
data or getting 2 TB data to
our local system and then
process it ?Size: 100 kb
Big Data
Before Hadoop
Obviously sending 100 kb file to data centers is better
● Even it is very easy to send 100 kb file to data center, but we should not send program to our datacenter
● Why it is?
● Before hadoop the basic part is Computation is processor bound.
What is this (Computation is processor bound ) ->
● computation is a program which you wanted to run on your data.
● So what exactly it is ?
● wherever you writing the program for that system only you have to fetch the data and process it on that system.
That is the only technique we have before hadoop. That means we can’t send our program to our datacenter to
process the data.
● what we can do we can fetch 2tb data to our local system and then by our computation program we can process it.
Big Data
Data Forms
Three V (Volume, velocity and Variety)
1: volume : (data size GB,TB,Peta Bytes) rapidly increasing in this form.
2: Velocity: This huge data creates some velocity problem too.
3: Variety: data is in different forms.
Note: In present time we have RDBMS for storing the relational data/structure data, so this is the technique by which we only store
our structure data or process structure data.
We can divide data into three forms
● Structured Data
● Unstructured Data
● Semistructured Data
If we are getting 100 % data so in this data 70% - 80% data is unstructured data or semi structured data. Rest 20 - 30% is
structured data.
Big Data
Data Forms
Question: How we are generating this unstructured and semi structured data is there any example ?
Answer : Yes
Unstructured data : Facebook videos, images, text message, audio.
Semi structured data: Log files. Suppose i have 2-3 mail account(gamil, yahoo) for all world most of the people have these
account. every account have their log files and store it on there gmail /yahoo servers etc.
4 gmail account * 5 times open a day = generate 20 log files (1 user)
2 yahoo mail * 5 times = 10 log files (1 user)
If i have other account like facebook, google+, instagram then suppose I generate 70 log files in a day.
so what about other people's in the world. These log files have lots of data.
Big Data
Definition
What is big data in other term :
Question : Suppose we have a system of 500 gb capacity if i wanted to store 300 gb it is possible.
Answer: Yes
Question: if i wanted to store 400 gb is it possible:
Answer: yes
Question: If i wanted to store 600 gb is it possible
Answer: no
If we can not store, we can not process so “The data which is beyond the storage capacity and beyond the processing power is
called big data.”
Big Data
Hadoop
● So simply if we have 500 gb storage capacity and we are storing 300 gb data and want to process it then.
● Question is how much time it takes
● Definitely it may take some time ..
● Here the hadoop comes into picture
Suppose i want to construct my home
One worker
Takes 1 yr to build
Big Data
Hadoop
3 worker
Takes 1-2 months to build
So the thing is that if we split the job to multiple guys we can complete that work in less time.
Big Data
Hadoop
● As the data is rapidly increasing suppose we get 1 tb data
● Our storage capacity is now 2 tb.
● Then we simply store it but what if we want to process 700 gb of data form that?
● It take much and more time. As we are using single system.
● So what we do?
● Instead of giving this 700 gb of data to one system.
● We simply divide this data and allocate on different system.
● So each and every system is working parallel and give the output in less time.
● Sounds Good
Big Data
Hadoop (Dividing work)
We can understand it using an example
Suppose I am in office and my attender has brought 100 files
i can process 50 files
In a day
50 files Pending
i can process 50 files
In a day
150 files Pending
100 files
100 more files
Big Data
Hadoop (Dividing work)
So instead of allocating work to single guy we should allocate work to 2-3 guys.
what we are trying to achieve here ?
We are trying to achieve here the processing power, means if we have huge data then we have some methods to process
that data in feasible amount of time.
Hadoop: hadoop know very well how to store huge data and how to process that huge data in less no. of time.
Hadoop
History of hadoop
● We all heard about (is a search engine). What it do is, it store data and when we search then it gives
us top results.
● In 1990, Google started working on how to store huge data and how to process it
● Google take 13 years for that and in 2003 he came with and idea
● He gives two things one is GFS (Google file system) and other one is (MapReduce)
● GFS is basically used for storing the huge amount of data
● And MapReduce is a technique by which we can process the data stored in the GFS
● But all of these two technologies are only in white paper it doesn’t implemented by the google
Not then Who ?
● it is also a largest web search engine after google.
● They guys also work on how to store that huge data and process it
● They guys take google white paper and started implemented it and give
● 2006-2007, HDFS (Hadoop distributed file system)
● 2007-2008, MapReduce (processing technique)
● These are two core concepts in hadoop (HDFS and MapReduce)
Hadoop
Definition
Who is the inventor of hadoop : Doug Cutting
“Hadoop is a open source framework given by apache software foundation basically overseen by apache software
foundation for storing huge data sets and for processing huge data sets with the cluster of commodity hardware
”
HDFS : Is a technique for storing our huge data with the help of commodity hardware.
MapReduce: Is a technique for processing the data which is stored in HDFS.
Thank You !

More Related Content

Similar to Big data & hadoop Introduction

Big data explanation with real time use case
 Big data explanation with real time use case Big data explanation with real time use case
Big data explanation with real time use case
N.Jagadish Kumar
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
Ankur Tripathi
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
VIJAYAPRABAP
 
GADLJRIET850691
GADLJRIET850691GADLJRIET850691
GADLJRIET850691
neha trivedi
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
AnjaliKumari301316
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
Nitesh Dubey
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
Big data
Big dataBig data
Big data
AkashDas112
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
AjayAgarwal107
 
Big data
Big dataBig data
Big Data
Big DataBig Data
Big Data
Priyanka Tuteja
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
wintersnow181189
 
Big Data
Big DataBig Data
Big Data
Mahesh Bmn
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
Siva Sankar
 
Big data
Big dataBig data
Big data
Deddy Setyadi
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
Venkata Reddy Konasani
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Mahawar
 

Similar to Big data & hadoop Introduction (20)

Big data explanation with real time use case
 Big data explanation with real time use case Big data explanation with real time use case
Big data explanation with real time use case
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
GADLJRIET850691
GADLJRIET850691GADLJRIET850691
GADLJRIET850691
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data
Big dataBig data
Big data
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Big data
Big dataBig data
Big data
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Recently uploaded

Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 

Recently uploaded (20)

Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 

Big data & hadoop Introduction

  • 1. Big Data & Hadoop By: Mohit Shukla Email: mohit.shukla@walkwel.in Software Engineer
  • 2. Big Data ❏ In today’s modern world we are surrounded by the data. ❏ What we are doing, we are storing the data and processing the data. ❏ So what is big data ? ❏ The data which is beyond the storage capacity and beyond the processing power is called big data. ❏ This data is generated from different resource like Sensors, cc cam , social networks like facebook, twitter, online shopping, airline etc. ❏ By these factors we are getting the huge data.
  • 3. Big Data 100 % data 90 % data Generated from last 2 years 10 % data Generated from the starting
  • 4. Big Data System variations from the few years Yr : 1990 Yr : 2017 HDD Capacity : 1GB- 20 GB Ram: 64-128 mb Reading capacity is : 10 kbps HDD Capacity : min 1Tb Ram: 4 -16 Gb Reading capacity is : 100 mbps Note: from 1990 to 2017 what happen is our HDD capacity is increased by 100 times and similar the Ram and reading capacity of System
  • 5. Big Data Challenges To understand challenges let's take an example Farmer Farmer Field Suppose in 1st year he generated 10 rice packet 2nd -> 20 rice packet 3rd -> 1000 rice pack Farmer storage Room(800 rice pck) 20 Rice packet 10 Rice pck
  • 6. Big Data Challenges ● Problem is farmer have a limited storage, if he generated 10-800 rice packed then he can storage it to his storage room. ● What if he generated 1000 rice packed ? ● well he can’t store it in that room. ● Similar with the data if we have proper storage capacity then we can store it but if we not, than what ? ● Then we have to storage that data in some other place. ● So in farmer case if he generated 1000 rice pck then he don’t have any room to store it so what he do .. he have to go to some godown or wharehouse. For storage the rice packet in there.
  • 8. Big Data Challenges Similar thing happen with the data storage, if we don’t have storage capacity for huge data then we have to go for data centers to store that data.. Data Centers What are these data centers: These are the people having servers to store your data they may be IBM servers, EMC servers or any other. Here we can store our data. And whenever we want to process it, we can get our data from these data centers to our local system and then we can process it in our local system.
  • 9. Big Data Challenges Why we are storing that data: ● Suppose i have one movie name “Titanic”, if i don’t want to watch that film why should i store it in my system. ● Coze may be some other day i watch that film. ● So if we don’t want to process the data then why should we store it. ● We are storing it because maybe some other day we process that data.
  • 10. Big Data Challenges How we can process that data : By written some code in java,mysql, C# or any other programming language. Data Centers with 1000 Tb storage Here i save 100 tb of my data But want to process 2 tb data of 100 tb data To achieve this i write 100kb of code. So which is better way now sending this 100 kb file to datacenter for processing the data or getting 2 TB data to our local system and then process it ?Size: 100 kb
  • 11. Big Data Before Hadoop Obviously sending 100 kb file to data centers is better ● Even it is very easy to send 100 kb file to data center, but we should not send program to our datacenter ● Why it is? ● Before hadoop the basic part is Computation is processor bound. What is this (Computation is processor bound ) -> ● computation is a program which you wanted to run on your data. ● So what exactly it is ? ● wherever you writing the program for that system only you have to fetch the data and process it on that system. That is the only technique we have before hadoop. That means we can’t send our program to our datacenter to process the data. ● what we can do we can fetch 2tb data to our local system and then by our computation program we can process it.
  • 12. Big Data Data Forms Three V (Volume, velocity and Variety) 1: volume : (data size GB,TB,Peta Bytes) rapidly increasing in this form. 2: Velocity: This huge data creates some velocity problem too. 3: Variety: data is in different forms. Note: In present time we have RDBMS for storing the relational data/structure data, so this is the technique by which we only store our structure data or process structure data. We can divide data into three forms ● Structured Data ● Unstructured Data ● Semistructured Data If we are getting 100 % data so in this data 70% - 80% data is unstructured data or semi structured data. Rest 20 - 30% is structured data.
  • 13. Big Data Data Forms Question: How we are generating this unstructured and semi structured data is there any example ? Answer : Yes Unstructured data : Facebook videos, images, text message, audio. Semi structured data: Log files. Suppose i have 2-3 mail account(gamil, yahoo) for all world most of the people have these account. every account have their log files and store it on there gmail /yahoo servers etc. 4 gmail account * 5 times open a day = generate 20 log files (1 user) 2 yahoo mail * 5 times = 10 log files (1 user) If i have other account like facebook, google+, instagram then suppose I generate 70 log files in a day. so what about other people's in the world. These log files have lots of data.
  • 14. Big Data Definition What is big data in other term : Question : Suppose we have a system of 500 gb capacity if i wanted to store 300 gb it is possible. Answer: Yes Question: if i wanted to store 400 gb is it possible: Answer: yes Question: If i wanted to store 600 gb is it possible Answer: no If we can not store, we can not process so “The data which is beyond the storage capacity and beyond the processing power is called big data.”
  • 15. Big Data Hadoop ● So simply if we have 500 gb storage capacity and we are storing 300 gb data and want to process it then. ● Question is how much time it takes ● Definitely it may take some time .. ● Here the hadoop comes into picture Suppose i want to construct my home One worker Takes 1 yr to build
  • 16. Big Data Hadoop 3 worker Takes 1-2 months to build So the thing is that if we split the job to multiple guys we can complete that work in less time.
  • 17. Big Data Hadoop ● As the data is rapidly increasing suppose we get 1 tb data ● Our storage capacity is now 2 tb. ● Then we simply store it but what if we want to process 700 gb of data form that? ● It take much and more time. As we are using single system. ● So what we do? ● Instead of giving this 700 gb of data to one system. ● We simply divide this data and allocate on different system. ● So each and every system is working parallel and give the output in less time. ● Sounds Good
  • 18. Big Data Hadoop (Dividing work) We can understand it using an example Suppose I am in office and my attender has brought 100 files i can process 50 files In a day 50 files Pending i can process 50 files In a day 150 files Pending 100 files 100 more files
  • 19. Big Data Hadoop (Dividing work) So instead of allocating work to single guy we should allocate work to 2-3 guys. what we are trying to achieve here ? We are trying to achieve here the processing power, means if we have huge data then we have some methods to process that data in feasible amount of time. Hadoop: hadoop know very well how to store huge data and how to process that huge data in less no. of time.
  • 20. Hadoop History of hadoop ● We all heard about (is a search engine). What it do is, it store data and when we search then it gives us top results. ● In 1990, Google started working on how to store huge data and how to process it ● Google take 13 years for that and in 2003 he came with and idea ● He gives two things one is GFS (Google file system) and other one is (MapReduce) ● GFS is basically used for storing the huge amount of data ● And MapReduce is a technique by which we can process the data stored in the GFS ● But all of these two technologies are only in white paper it doesn’t implemented by the google Not then Who ? ● it is also a largest web search engine after google. ● They guys also work on how to store that huge data and process it ● They guys take google white paper and started implemented it and give ● 2006-2007, HDFS (Hadoop distributed file system) ● 2007-2008, MapReduce (processing technique) ● These are two core concepts in hadoop (HDFS and MapReduce)
  • 21. Hadoop Definition Who is the inventor of hadoop : Doug Cutting “Hadoop is a open source framework given by apache software foundation basically overseen by apache software foundation for storing huge data sets and for processing huge data sets with the cluster of commodity hardware ” HDFS : Is a technique for storing our huge data with the help of commodity hardware. MapReduce: Is a technique for processing the data which is stored in HDFS.