Big Data And Hadoop

Madan Mohan Malaviya University
of Technology ,Gorakhpur
BIG DATA
&
Presented by:
Ankur Tripathi
B.Tech, 3rd
Year
Computer Science and Engg.

What is BIG DATA ?
Big Data:
•Big data is a term used to describe the voluminous amount of
unstructured and semi-structured data a company creates.
•Data that would take too much time , cost and money to load into a
relational database for analysis.
•Every day we create 2.5 exabytes - that's 2.5 billion gigabytes (GB)
- of data.
•90% of the data in the world today has been created in the last two
years alone
In Every 18 months Data get doubled !

Scale:
Every minute on internet:
 204 million emails are sent !
 4 million Google searches are made !
 277,000 tweets are sent !
 2.4 million Facebook Posts are created
 And lot more..
Source:
Flood of data is coming from:
 Web data, e-commerce
 Purchases at department/grocery stores
 Bank/Credit Card transactions
 Social Network
BIG DATA : scale and source

BIG DATA :type
• MainFrame
• SQL Server
• Oracle
• DB2
• Sybase
• Access,Excel,
txt etc.
• Teradata
• Emerging Market
Data
• E-commerce
• Third Party Data
• Weather
• Stock
Exchange
• Panel
• Syndicated Data
• Social Media
• Chatter,txt
• Blogs
• Tweets
• Likes
• Followers
• Digital,Video
• Audio
• Geo-Spacial
Structured Un-Structured Multi-
Structured

BIG DATA :advantages
 Facebook: advertisements and suggestions
 Amazon: recommendations
 Google: analytics, search etc.
 Twitter: trending topics
 Business Risk analysis
 Improved business decisions
 Improved marketing strategy and targeting e.g. that of ford
 And Lot more..

Challenges created by BIG DATA ?
Disk Speed :
Traditional Hard Drive: 60-100 mbps
Solid State Disk : 250-500 mbps
Processing Time :
For 1 TB of File :
Traditional Hard-Disk
Solid State Disk
So, Main problem (storage and analysis) : Disk speed is increasing almost
linearly whereas BIG DATA is growing Exponentially!!
Other Problems:
Risk of Machine Failure , Backup Problem , Expensive .
10000 seconds
167 minutes~3 hrs
2000 Seconds
Approx 33 mins.

So What do We Do?
Process Data in Parallel ?
An Idea :Parallelism (multiple processors or CPU’s in a single machine )
 Process (read/write) data in parallel instead of sequencial.
 Processing speed increases
Problem:
Processing Speed Increases greater than seek speed
Deadlock
Synchronisation
Limited BandWidth
Drive Failure
What Next : Distributed Computing?

Distributed Computing
Yes, we have Distributed computing but it also come up
with few Challenges !
Hardware Failure:
As soon as we start using many pieces of hardware, the chance that one will
fail is fairly high.
 Combine the data after analysis:
Data read from one disk may need to be combined with the data from any of
the other 99 disks for the analysis purpose.
Getting the data to the processors becomes the bottleneck

Requirement of new approach
 System shall manage and heal itself
 Automatically and transparently route around failure
 Speculatively execute redundant tasks if certain nodes are detected to be
slow
 Data recoverability
If a component of the system fails, its workload should be assumed by still-
functioning units in the system.
 Consistency
Component failures during execution of a job should not affect the outcome of the job.
 Scalability
Increasing resources should support a proportional increase in load capacity.
To address these issues Hadoop comes in !

To The Rescue!
Apache Hadoop is a framework that allows for distributed
processing of large data sets across cluster of commodity
computers using a simple programming model.
A common way of avoiding data loss is through replication:
The Hadoop Distributed Filesystem (HDFS), takes care of
this problem.
Problem of combining data after analysis is solved by a simple
programming model- MapReduce.
Is
Supports
Hadoop
Framework of Tools
Running Application on
BIG DATA

History Of Hadoop !
In 1995
Why Google Succeed !
In 2000
?
In 2003 : Google released Paper on GFS
In 2004 : Google released Paper on MapReduce
In 2005 :Doug Cutting and Mike Cafarella who was working at Yahoo at the time,
using idea of Google paper developed Hadoop and named it after his son's toy
elephant .
In 2006 : Yahoo donated project to Apache .
Excile
Altavista
Lycos
Google Map Reduce
File System
(GFS)

HDFS: Hadoop Distributed Filesystem
HDFS, the Hadoop Distributed File System, is a distributed file system
designed to hold very large amounts of data (terabytes or even petabytes), and
provide high-throughput access to this information.
GFS like
Master-Slave Design
 Master:
Single NameNode for managing FS metadata
Slave:
Multiple DataNodes for storing Data
One More:
Secondary NameNode for CheckPointing

HDFS Architecture
Secondary
Node

MapReduce
 MapReduce is a programming model for processing and generating
large data sets with a parallel, distributed algorithm on a cluster

Combining the Two
Task
Tracker
Job
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Name
Node
Data
Node
Backed Up
Slaves
Secondary Node
Master
1000s

Who Uses Hadoop ?
Aadhar project by Govt. of India uses Hadoop !

Big Data And Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Big Data And Hadoop

Recently uploaded

Big Data And Hadoop