Madan Mohan Malaviya University
of Technology ,Gorakhpur
BIG DATA
&
Presented by:
Ankur Tripathi
B.Tech, 3rd
Year
Computer Science and Engg.
What is BIG DATA ?
Big Data:
•Big data is a term used to describe the voluminous amount of
unstructured and semi-structured data a company creates.
•Data that would take too much time , cost and money to load into a
relational database for analysis.
•Every day we create 2.5 exabytes - that's 2.5 billion gigabytes (GB)
- of data.
•90% of the data in the world today has been created in the last two
years alone
In Every 18 months Data get doubled !
Scale:
Every minute on internet:
 204 million emails are sent !
 4 million Google searches are made !
 277,000 tweets are sent !
 2.4 million Facebook Posts are created
 And lot more..
Source:
Flood of data is coming from:
 Web data, e-commerce
 Purchases at department/grocery stores
 Bank/Credit Card transactions
 Social Network
BIG DATA : scale and source
BIG DATA :type
• MainFrame
• SQL Server
• Oracle
• DB2
• Sybase
• Access,Excel,
txt etc.
• Teradata
• Emerging Market
Data
• E-commerce
• Third Party Data
• Weather
• Stock
Exchange
• Panel
• Syndicated Data
• Social Media
• Chatter,txt
• Blogs
• Tweets
• Likes
• Followers
• Digital,Video
• Audio
• Geo-Spacial
Structured Un-Structured Multi-
Structured
BIG DATA :advantages
 Facebook: advertisements and suggestions
 Amazon: recommendations
 Google: analytics, search etc.
 Twitter: trending topics
 Business Risk analysis
 Improved business decisions
 Improved marketing strategy and targeting e.g. that of ford
 And Lot more..
Challenges created by BIG DATA ?
Disk Speed :
Traditional Hard Drive: 60-100 mbps
Solid State Disk : 250-500 mbps
Processing Time :
For 1 TB of File :
Traditional Hard-Disk
Solid State Disk
So, Main problem (storage and analysis) : Disk speed is increasing almost
linearly whereas BIG DATA is growing Exponentially!!
Other Problems:
Risk of Machine Failure , Backup Problem , Expensive .
10000 seconds
167 minutes~3 hrs
2000 Seconds
Approx 33 mins.
So What do We Do?
Process Data in Parallel ?
An Idea :Parallelism (multiple processors or CPU’s in a single machine )
 Process (read/write) data in parallel instead of sequencial.
 Processing speed increases
Problem:
Processing Speed Increases greater than seek speed
Deadlock
Synchronisation
Limited BandWidth
Drive Failure
What Next : Distributed Computing?
Distributed Computing
Yes, we have Distributed computing but it also come up
with few Challenges !
Hardware Failure:
As soon as we start using many pieces of hardware, the chance that one will
fail is fairly high.
 Combine the data after analysis:
Data read from one disk may need to be combined with the data from any of
the other 99 disks for the analysis purpose.
Getting the data to the processors becomes the bottleneck
Requirement of new approach
 System shall manage and heal itself
 Automatically and transparently route around failure
 Speculatively execute redundant tasks if certain nodes are detected to be
slow
 Data recoverability
If a component of the system fails, its workload should be assumed by still-
functioning units in the system.
 Consistency
Component failures during execution of a job should not affect the outcome of the job.
 Scalability
Increasing resources should support a proportional increase in load capacity.
To address these issues Hadoop comes in !
To The Rescue!
Apache Hadoop is a framework that allows for distributed
processing of large data sets across cluster of commodity
computers using a simple programming model.
A common way of avoiding data loss is through replication:
The Hadoop Distributed Filesystem (HDFS), takes care of
this problem.
Problem of combining data after analysis is solved by a simple
programming model- MapReduce.
Is
Supports
Hadoop
Framework of Tools
Running Application on
BIG DATA
History Of Hadoop !
In 1995
Why Google Succeed !
In 2000
?
In 2003 : Google released Paper on GFS
In 2004 : Google released Paper on MapReduce
In 2005 :Doug Cutting and Mike Cafarella who was working at Yahoo at the time,
using idea of Google paper developed Hadoop and named it after his son's toy
elephant .
In 2006 : Yahoo donated project to Apache .
Excile
Altavista
Lycos
Google Map Reduce
File System
(GFS)
HDFS: Hadoop Distributed Filesystem
HDFS, the Hadoop Distributed File System, is a distributed file system
designed to hold very large amounts of data (terabytes or even petabytes), and
provide high-throughput access to this information.
GFS like
Master-Slave Design
 Master:
Single NameNode for managing FS metadata
Slave:
Multiple DataNodes for storing Data
One More:
Secondary NameNode for CheckPointing
HDFS Architecture
Secondary
Node
MapReduce
 MapReduce is a programming model for processing and generating
large data sets with a parallel, distributed algorithm on a cluster
MapReduce cont.
MapReduce cont.
Combining the Two
Task
Tracker
Job
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Name
Node
Data
Node
Backed Up
Slaves
Secondary Node
Master
1000s
Hadoop Ecosystem
Who Uses Hadoop ?
Aadhar project by Govt. of India uses Hadoop !
Big Data And Hadoop

Big Data And Hadoop

  • 1.
    Madan Mohan MalaviyaUniversity of Technology ,Gorakhpur BIG DATA & Presented by: Ankur Tripathi B.Tech, 3rd Year Computer Science and Engg.
  • 2.
    What is BIGDATA ? Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. •Data that would take too much time , cost and money to load into a relational database for analysis. •Every day we create 2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data. •90% of the data in the world today has been created in the last two years alone In Every 18 months Data get doubled !
  • 3.
    Scale: Every minute oninternet:  204 million emails are sent !  4 million Google searches are made !  277,000 tweets are sent !  2.4 million Facebook Posts are created  And lot more.. Source: Flood of data is coming from:  Web data, e-commerce  Purchases at department/grocery stores  Bank/Credit Card transactions  Social Network BIG DATA : scale and source
  • 4.
    BIG DATA :type •MainFrame • SQL Server • Oracle • DB2 • Sybase • Access,Excel, txt etc. • Teradata • Emerging Market Data • E-commerce • Third Party Data • Weather • Stock Exchange • Panel • Syndicated Data • Social Media • Chatter,txt • Blogs • Tweets • Likes • Followers • Digital,Video • Audio • Geo-Spacial Structured Un-Structured Multi- Structured
  • 5.
    BIG DATA :advantages Facebook: advertisements and suggestions  Amazon: recommendations  Google: analytics, search etc.  Twitter: trending topics  Business Risk analysis  Improved business decisions  Improved marketing strategy and targeting e.g. that of ford  And Lot more..
  • 6.
    Challenges created byBIG DATA ? Disk Speed : Traditional Hard Drive: 60-100 mbps Solid State Disk : 250-500 mbps Processing Time : For 1 TB of File : Traditional Hard-Disk Solid State Disk So, Main problem (storage and analysis) : Disk speed is increasing almost linearly whereas BIG DATA is growing Exponentially!! Other Problems: Risk of Machine Failure , Backup Problem , Expensive . 10000 seconds 167 minutes~3 hrs 2000 Seconds Approx 33 mins.
  • 7.
    So What doWe Do? Process Data in Parallel ? An Idea :Parallelism (multiple processors or CPU’s in a single machine )  Process (read/write) data in parallel instead of sequencial.  Processing speed increases Problem: Processing Speed Increases greater than seek speed Deadlock Synchronisation Limited BandWidth Drive Failure What Next : Distributed Computing?
  • 8.
    Distributed Computing Yes, wehave Distributed computing but it also come up with few Challenges ! Hardware Failure: As soon as we start using many pieces of hardware, the chance that one will fail is fairly high.  Combine the data after analysis: Data read from one disk may need to be combined with the data from any of the other 99 disks for the analysis purpose. Getting the data to the processors becomes the bottleneck
  • 9.
    Requirement of newapproach  System shall manage and heal itself  Automatically and transparently route around failure  Speculatively execute redundant tasks if certain nodes are detected to be slow  Data recoverability If a component of the system fails, its workload should be assumed by still- functioning units in the system.  Consistency Component failures during execution of a job should not affect the outcome of the job.  Scalability Increasing resources should support a proportional increase in load capacity. To address these issues Hadoop comes in !
  • 10.
    To The Rescue! ApacheHadoop is a framework that allows for distributed processing of large data sets across cluster of commodity computers using a simple programming model. A common way of avoiding data loss is through replication: The Hadoop Distributed Filesystem (HDFS), takes care of this problem. Problem of combining data after analysis is solved by a simple programming model- MapReduce. Is Supports Hadoop Framework of Tools Running Application on BIG DATA
  • 11.
    History Of Hadoop! In 1995 Why Google Succeed ! In 2000 ? In 2003 : Google released Paper on GFS In 2004 : Google released Paper on MapReduce In 2005 :Doug Cutting and Mike Cafarella who was working at Yahoo at the time, using idea of Google paper developed Hadoop and named it after his son's toy elephant . In 2006 : Yahoo donated project to Apache . Excile Altavista Lycos Google Map Reduce File System (GFS)
  • 12.
    HDFS: Hadoop DistributedFilesystem HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. GFS like Master-Slave Design  Master: Single NameNode for managing FS metadata Slave: Multiple DataNodes for storing Data One More: Secondary NameNode for CheckPointing
  • 13.
  • 14.
    MapReduce  MapReduce isa programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Who Uses Hadoop? Aadhar project by Govt. of India uses Hadoop !