SlideShare a Scribd company logo
1 of 7
What is Big Data ?
Big Data basically refers to, huge volume of data that cannot be, stored and processed
using the traditional approach within the given time frame.
The next big question that comes to our mind is?
How huge this data needs to be? In order to be classified as Big Data?
There is a lot of misconception, while referring the term Big Data.
We usually use the term big data, to refer to the data, that is, either in gigabytes or
terabytes or petabytes or exabytes or anything that is larger than this in size.
This does not defines the term Big Data completely.
Even a small amount of data can be referred to as Big Data depending on the context it
is used.
Let me take an example and try to explain it to you.
For instance if we try to attach a document that is of 100 megabytes in size to an email
we would not be able to do so.
As the email system would not support an attachment of this size.
Therefore this 100 megabytes of attachment with respect to email can be referred to as
Big Data.
Let me take another example and try to explain the term Big Data.
Let us say we have around 10 terabytes of image files, upon which certain processing
needs to be done.
For instance we may want to resize and enhance these images within a given time
frame.
Suppose if we make use of the traditional system to perform this task.
We would not be able to accomplish this task within the given time frame.
As the computing resources of the traditional system would not be sufficient to
accomplish this task on time.
Therefore this 10 terabytes of image files can be referred to as Big Data.
Now let us try to understand Big Data using some real world examples.
I believe you all might be aware of some the popular social networking sites, such as
Facebook, Twitter, LinkedIn, Google Plus and YouTube.
Each of this site, receives huge volume of data on a daily basis.
It has been reported on some of the popular tech blogs that.
Facebook alone receives around 100 terabytes of data each day.
Whereas Twitter processes around 400 million tweets each day.
As far as LinkedIn and Google Plus are concerned each of the site receives tens of
terabytes of data on a daily basis.
And finally coming to YouTube, it has been reported that, each minute around 48 hours
of fresh videos are uploaded to YouTube.
You can just imagine, how much volume of data is being stored and processed on these
sites.
But as the number of users keep growing on these sites, storing and processing this
data becomes a challenging task.
Since this data holds a lot of valuable information, this data needs to be processed in a
short span of time.
By using this valuable information, companies can boost their sales and generate more
revenue.
By making using of the traditional computing system, we would not be able to
accomplish this task within the given time frame, as the computing resources of the
traditional system would not be sufficient for processing and storing, such a huge
volume of data.
Let me take another real world example related to the airline industry and try to explain
the term Big Data.
For instance the aircrafts, while they are flying keep transmitting data to the air traffic
control located at the airports.
The air traffic control uses this data to track and monitor the status and progress of the
flight on a real time basis.
Since multiple aircrafts would be transmitting this data simultaneously to the air traffic
control.
A huge volume of data gets accumulated at the air traffic control within a short span of
time.
Therefore it becomes a challenging task to manage and process this huge volume of
data using the traditional approach.
Hence we can term this huge volume of data as Big Data.
How is Big Data classified?
Big Data can be classified into 3 different categories.
The first one is Structured Data.
The data, that does have a proper format, associated to it, can be referred to as,
Structured Data.
For example the data that is present within the databases, the csv files, and the excel
spreadsheets can be referred to as Structured Data.
The next one is Semi-Structured Data.
The data, that does not have, a proper format, associated to it, can be referred to as,
Semi-Structured Data.
For example the data that is present within the emails, the log files and the word
documents can be referred to as Semi-Structured Data.
And the last one is Un-Structured Data.
The data, that does not have, any format associated to it, can be referred to as, Un-
Structured Data.
For example the image files, the audio files and the video files can be referred to as Un-
Structured Data.
This is how the Big Data can be classified.
Characteristics of Big Data.
Big Data is categorized by 3 important characteristics.
The first one is Volume.
Volume refers to the amount of data that is getting generated.
The next one is Velocity.
Velocity refers to the speed at which the data is getting generated.
And the last one is Variety.
Variety refers to the different types of data that is getting generated.
These are the 3 important characteristics of Big Data
Big Data Challenges.
Challenges Associated with Big Data
There are 2 main challenges associated with Big Data.
The 1st challenge is, how do we store and manage such a huge volume of data,
efficiently?
And the 2nd challenge is, how do we process and extract valuable information from this
huge volume of data within the given time frame?
These are the 2 main challenges associated with the Big Data, that led to the
development of Hadoop framework.
How is Big Data stored and processed?
In a traditional approach, usually the data that is being generated out of the
organizations, the financial institutions such as banks or stock markets and the hospitals
is given as an input to the ETL System.
An ETL System, would then Extract this data, and Transform this data, that is, it would
convert this data into proper format and finally load this data onto the database.
Now the end users can generate reports and perform analytics, by querying this data.
But as this data grows, it becomes a very challenging task to manage and process this
data, using the traditional approach, this is one of the fundamental drawbacks of using
the Traditional Approach.
Now let us try to understand some of the major drawbacks of using the Traditional
Approach.
The 1st drawback is, it an expensive system, as it requires a lot of investment for
implementing, or upgrading the system, therefore it is, out of the reach of small and mid-
sized companies.
The 2nd drawback is, scalability.
As the data grows, expanding the system is a challenging task.
And the 3rd drawback is, it is time consuming, it takes lot of time to process and extract,
valuable information from the data.
I hope you might have understood the Traditional Approach of Storing and Processing
Big Data and its associated drawbacks.
What is Hadoop?
Hadoop is an open source framework, developed by Doug cutting in 2006, and it is
managed by the Apache Software Foundation.
The project was named as "Hadoop" after the name of a yellow stuffed toy elephant,
which the Doug Cutting's son had.
Hadoop is designed to store and process, a huge volume of data, efficiently.
The Hadoop framework comprises of 2 main components.
The 1st component is the HDFS, HDFS stands for Hadoop Distributed File System.
And the 2nd component is the MapReduce.
The HDFS takes care of storing and managing the data within the Hadoop Cluster.
Whereas the MapReduce takes care of processing and computing the data, that is
present within the HDFS.
Now let us try to understand what actually makes up a Hadoop Cluster.
A Hadoop Cluster is made up of 2 main nodes.
The 1st one is the Master Node and the 2nd one is the Slave Node.
The Master Node, is responsible for running the NameNode and JobTracker daemons.
For Your Information,
Node is a technical term used to describe a machine or a computer that is present
within a cluster.
And Daemon is a technical term used to describe a background process running on a
Linux machine.
The Slave Node, on the other hand is responsible for running the DataNode and
TaskTracker daemons.
The NameNode and DataNode are responsible for storing and managing the data, and
they are commonly referred as Storage Node.
Whereas the JobTracker and TaskTracker are responsible for processing and
computing the data, and they are commonly referred to as Compute Node.
Usually the NameNode and JobTracker are configured and running on a single
machine.
Whereas the DataNode and TaskTracker are configured on multiple machines, but can
have instances running on more than one machine at the same time.
Apart from all this, we also have a Secondary NameNode, as part of the Hadoop
Cluster, which we would be discussing about this in the later sessions.
Important features of Hadoop .
In this session let us try to understand, some of the important features offered by the
Hadoop framework.
The 1st important feature offered by Hadoop is, it is a cost effective system.
What do we mean by this?
Hadoop does not requires any expensive or specialized hardware, in order to be
implemented.
In other words, it can implemented on a simple hardware, these hardware components
are technically referred to as Commodity Hardware.
The next important feature on the list is, Hadoop supports a large cluster of Nodes.
Therefore a Hadoop Cluster can be made up of 100’s and 1000’s of Nodes.
One of the main advantage of having a large cluster is, offering More Computing Power
and a Huge Storage system to the clients.
The next important feature on the list is, Hadoop supports Parallel Processing of Data,
therefore the data can be processed simultaneously across all the nodes within the
cluster, and thus saving a lot of time.
The next important feature offered by Hadoop is Distributed Data. The Hadoop
Framework takes care of splitting and distributing the data across all the nodes within a
cluster. It also replicates the data, over the entire cluster.
The next important feature on the list is, Automatic Failover Management. In case if any
of the node, within the cluster fails. The Hadoop Framework would replace that
particular machine, with another machine, and it replicates all the configuration settings
and the data, from the failed machine onto this newly replicated machine. Admins may
need not have to worry about this, once the Automatic Failover Management has been
properly configured on a cluster.
The next important feature on the list is, Data Locality Optimization. It is one of the most
important feature offered by the Hadoop Framework. Let us try to understand what
actually it does. In a traditional approach, whenever a software program is executed the
data is transferred from the datacenter onto the machine, where the program is getting
executed.
For example, let us say, the data required by our program is located at some data
center in USA, and the program that requires this data is located at Singapore. Let us
assume the data required by our program is around 1 petta byte in size. Transferring
such a huge volume of data from USA to Singapore, would consume a lot of bandwidth
and time.
Hadoop eliminates this problem, by transferring the code, which is of few megabytes in
size, located at Singapore to the datacenter located in USA, and then it, compiles and
executes the code locally on the data.
Since this code is of few megabytes in size as compared to the input data which is of 1
petta byte is size, this saves a lot of time and bandwidth.
The next important feature on the list is, Heterogeneous Cluster. Even this can be
classified as one of the most important feature offered by Hadoop Framework.
We know that a Hadoop Cluster is made up of several nodes.
Basically Node is a technical term used to refer to a machine within the cluster.
Let us try to understand, what do I mean by Heterogeneous Cluster.
A Heterogeneous Cluster basically refers to a cluster, within which each node can be
from a different vendor, and each node can be running a different version and flavor of
operating system.
Let us say our cluster is made up of 4 nodes.
From Instance, the 1st node is an IBM machine running on Red Hat Enterprise Linux,
the 2nd node is an Intel machine running on Ubuntu, the 3rd node is an AMD machine
running on Fedora, and the last node is an HP machine running on Cent OS.
The next important feature on the list is, Scalability. Scalability refers to the ability of
adding or removing the nodes or the hardware components to the cluster.
We can easily add or remove a node to or from a Hadoop Cluster without bringing down
or affecting the cluster operation.
Even we the individual hardware components such as RAM and Hard Drive can be
added or removed from a cluster on a fly.

More Related Content

What's hot

Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
A solution to controlling and dealing with the dark data inspired by biologic...
A solution to controlling and dealing with the dark data inspired by biologic...A solution to controlling and dealing with the dark data inspired by biologic...
A solution to controlling and dealing with the dark data inspired by biologic...Ali Alizade Haghighi
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmekideaport
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaEdureka!
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 

What's hot (16)

Video 1 big data
Video 1 big dataVideo 1 big data
Video 1 big data
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
A mathematical appraisal
A mathematical appraisalA mathematical appraisal
A mathematical appraisal
 
A solution to controlling and dealing with the dark data inspired by biologic...
A solution to controlling and dealing with the dark data inspired by biologic...A solution to controlling and dealing with the dark data inspired by biologic...
A solution to controlling and dealing with the dark data inspired by biologic...
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 

Similar to What is big data

Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersManjeet Singh Nagi
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 

Similar to What is big data (20)

Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
GADLJRIET850691
GADLJRIET850691GADLJRIET850691
GADLJRIET850691
 
Big data
Big dataBig data
Big data
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managers
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data
Big DataBig Data
Big Data
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
How Do I Learn Big Data
How Do I Learn Big DataHow Do I Learn Big Data
How Do I Learn Big Data
 
How Do I Learn Big Data
How Do I Learn Big DataHow Do I Learn Big Data
How Do I Learn Big Data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

What is big data

  • 1. What is Big Data ? Big Data basically refers to, huge volume of data that cannot be, stored and processed using the traditional approach within the given time frame. The next big question that comes to our mind is? How huge this data needs to be? In order to be classified as Big Data? There is a lot of misconception, while referring the term Big Data. We usually use the term big data, to refer to the data, that is, either in gigabytes or terabytes or petabytes or exabytes or anything that is larger than this in size. This does not defines the term Big Data completely. Even a small amount of data can be referred to as Big Data depending on the context it is used. Let me take an example and try to explain it to you. For instance if we try to attach a document that is of 100 megabytes in size to an email we would not be able to do so. As the email system would not support an attachment of this size. Therefore this 100 megabytes of attachment with respect to email can be referred to as Big Data. Let me take another example and try to explain the term Big Data. Let us say we have around 10 terabytes of image files, upon which certain processing needs to be done. For instance we may want to resize and enhance these images within a given time frame. Suppose if we make use of the traditional system to perform this task. We would not be able to accomplish this task within the given time frame. As the computing resources of the traditional system would not be sufficient to accomplish this task on time. Therefore this 10 terabytes of image files can be referred to as Big Data. Now let us try to understand Big Data using some real world examples. I believe you all might be aware of some the popular social networking sites, such as Facebook, Twitter, LinkedIn, Google Plus and YouTube. Each of this site, receives huge volume of data on a daily basis. It has been reported on some of the popular tech blogs that. Facebook alone receives around 100 terabytes of data each day. Whereas Twitter processes around 400 million tweets each day. As far as LinkedIn and Google Plus are concerned each of the site receives tens of terabytes of data on a daily basis. And finally coming to YouTube, it has been reported that, each minute around 48 hours
  • 2. of fresh videos are uploaded to YouTube. You can just imagine, how much volume of data is being stored and processed on these sites. But as the number of users keep growing on these sites, storing and processing this data becomes a challenging task. Since this data holds a lot of valuable information, this data needs to be processed in a short span of time. By using this valuable information, companies can boost their sales and generate more revenue. By making using of the traditional computing system, we would not be able to accomplish this task within the given time frame, as the computing resources of the traditional system would not be sufficient for processing and storing, such a huge volume of data. Let me take another real world example related to the airline industry and try to explain the term Big Data. For instance the aircrafts, while they are flying keep transmitting data to the air traffic control located at the airports. The air traffic control uses this data to track and monitor the status and progress of the flight on a real time basis. Since multiple aircrafts would be transmitting this data simultaneously to the air traffic control. A huge volume of data gets accumulated at the air traffic control within a short span of time. Therefore it becomes a challenging task to manage and process this huge volume of data using the traditional approach. Hence we can term this huge volume of data as Big Data. How is Big Data classified? Big Data can be classified into 3 different categories. The first one is Structured Data. The data, that does have a proper format, associated to it, can be referred to as, Structured Data. For example the data that is present within the databases, the csv files, and the excel spreadsheets can be referred to as Structured Data. The next one is Semi-Structured Data. The data, that does not have, a proper format, associated to it, can be referred to as, Semi-Structured Data.
  • 3. For example the data that is present within the emails, the log files and the word documents can be referred to as Semi-Structured Data. And the last one is Un-Structured Data. The data, that does not have, any format associated to it, can be referred to as, Un- Structured Data. For example the image files, the audio files and the video files can be referred to as Un- Structured Data. This is how the Big Data can be classified. Characteristics of Big Data. Big Data is categorized by 3 important characteristics. The first one is Volume. Volume refers to the amount of data that is getting generated. The next one is Velocity. Velocity refers to the speed at which the data is getting generated. And the last one is Variety. Variety refers to the different types of data that is getting generated. These are the 3 important characteristics of Big Data Big Data Challenges. Challenges Associated with Big Data There are 2 main challenges associated with Big Data. The 1st challenge is, how do we store and manage such a huge volume of data, efficiently? And the 2nd challenge is, how do we process and extract valuable information from this huge volume of data within the given time frame?
  • 4. These are the 2 main challenges associated with the Big Data, that led to the development of Hadoop framework. How is Big Data stored and processed? In a traditional approach, usually the data that is being generated out of the organizations, the financial institutions such as banks or stock markets and the hospitals is given as an input to the ETL System. An ETL System, would then Extract this data, and Transform this data, that is, it would convert this data into proper format and finally load this data onto the database. Now the end users can generate reports and perform analytics, by querying this data. But as this data grows, it becomes a very challenging task to manage and process this data, using the traditional approach, this is one of the fundamental drawbacks of using the Traditional Approach. Now let us try to understand some of the major drawbacks of using the Traditional Approach. The 1st drawback is, it an expensive system, as it requires a lot of investment for implementing, or upgrading the system, therefore it is, out of the reach of small and mid- sized companies. The 2nd drawback is, scalability. As the data grows, expanding the system is a challenging task. And the 3rd drawback is, it is time consuming, it takes lot of time to process and extract, valuable information from the data. I hope you might have understood the Traditional Approach of Storing and Processing Big Data and its associated drawbacks. What is Hadoop? Hadoop is an open source framework, developed by Doug cutting in 2006, and it is managed by the Apache Software Foundation. The project was named as "Hadoop" after the name of a yellow stuffed toy elephant, which the Doug Cutting's son had.
  • 5. Hadoop is designed to store and process, a huge volume of data, efficiently. The Hadoop framework comprises of 2 main components. The 1st component is the HDFS, HDFS stands for Hadoop Distributed File System. And the 2nd component is the MapReduce. The HDFS takes care of storing and managing the data within the Hadoop Cluster. Whereas the MapReduce takes care of processing and computing the data, that is present within the HDFS. Now let us try to understand what actually makes up a Hadoop Cluster. A Hadoop Cluster is made up of 2 main nodes. The 1st one is the Master Node and the 2nd one is the Slave Node. The Master Node, is responsible for running the NameNode and JobTracker daemons. For Your Information, Node is a technical term used to describe a machine or a computer that is present within a cluster. And Daemon is a technical term used to describe a background process running on a Linux machine. The Slave Node, on the other hand is responsible for running the DataNode and TaskTracker daemons. The NameNode and DataNode are responsible for storing and managing the data, and they are commonly referred as Storage Node. Whereas the JobTracker and TaskTracker are responsible for processing and computing the data, and they are commonly referred to as Compute Node. Usually the NameNode and JobTracker are configured and running on a single machine. Whereas the DataNode and TaskTracker are configured on multiple machines, but can have instances running on more than one machine at the same time. Apart from all this, we also have a Secondary NameNode, as part of the Hadoop Cluster, which we would be discussing about this in the later sessions. Important features of Hadoop . In this session let us try to understand, some of the important features offered by the
  • 6. Hadoop framework. The 1st important feature offered by Hadoop is, it is a cost effective system. What do we mean by this? Hadoop does not requires any expensive or specialized hardware, in order to be implemented. In other words, it can implemented on a simple hardware, these hardware components are technically referred to as Commodity Hardware. The next important feature on the list is, Hadoop supports a large cluster of Nodes. Therefore a Hadoop Cluster can be made up of 100’s and 1000’s of Nodes. One of the main advantage of having a large cluster is, offering More Computing Power and a Huge Storage system to the clients. The next important feature on the list is, Hadoop supports Parallel Processing of Data, therefore the data can be processed simultaneously across all the nodes within the cluster, and thus saving a lot of time. The next important feature offered by Hadoop is Distributed Data. The Hadoop Framework takes care of splitting and distributing the data across all the nodes within a cluster. It also replicates the data, over the entire cluster. The next important feature on the list is, Automatic Failover Management. In case if any of the node, within the cluster fails. The Hadoop Framework would replace that particular machine, with another machine, and it replicates all the configuration settings and the data, from the failed machine onto this newly replicated machine. Admins may need not have to worry about this, once the Automatic Failover Management has been properly configured on a cluster. The next important feature on the list is, Data Locality Optimization. It is one of the most important feature offered by the Hadoop Framework. Let us try to understand what actually it does. In a traditional approach, whenever a software program is executed the data is transferred from the datacenter onto the machine, where the program is getting executed. For example, let us say, the data required by our program is located at some data center in USA, and the program that requires this data is located at Singapore. Let us assume the data required by our program is around 1 petta byte in size. Transferring such a huge volume of data from USA to Singapore, would consume a lot of bandwidth and time. Hadoop eliminates this problem, by transferring the code, which is of few megabytes in size, located at Singapore to the datacenter located in USA, and then it, compiles and executes the code locally on the data. Since this code is of few megabytes in size as compared to the input data which is of 1 petta byte is size, this saves a lot of time and bandwidth. The next important feature on the list is, Heterogeneous Cluster. Even this can be classified as one of the most important feature offered by Hadoop Framework. We know that a Hadoop Cluster is made up of several nodes. Basically Node is a technical term used to refer to a machine within the cluster.
  • 7. Let us try to understand, what do I mean by Heterogeneous Cluster. A Heterogeneous Cluster basically refers to a cluster, within which each node can be from a different vendor, and each node can be running a different version and flavor of operating system. Let us say our cluster is made up of 4 nodes. From Instance, the 1st node is an IBM machine running on Red Hat Enterprise Linux, the 2nd node is an Intel machine running on Ubuntu, the 3rd node is an AMD machine running on Fedora, and the last node is an HP machine running on Cent OS. The next important feature on the list is, Scalability. Scalability refers to the ability of adding or removing the nodes or the hardware components to the cluster. We can easily add or remove a node to or from a Hadoop Cluster without bringing down or affecting the cluster operation. Even we the individual hardware components such as RAM and Hard Drive can be added or removed from a cluster on a fly.