A
Presentation
On
Big Data
&
Hadoop
Submitted To:-
Mrs. Sonika Narang
Mrs. Poonam Beri
Submitted By:-
Ms. Shabnam
34633
Big data means really a big data, it is a collection of
large & complex data that it becomes difficult to
process using traditional data processing
applications.
Black Box Data
Social Media Data
Stock Exchange Data
Power Grid Data
Transport Data
3Vs /Characterizing BIG
DATA
Volume
Variety
Velocity
TYPES OF BIG DATA
 Structured Data:-Relational Data
 Semi-Structured Data:-XML Data
 Unstructured Data:-PDF ,Word ,Text ,Media Logs etc.
 Daily, updation of 0.5 PBs on FACEBOOK including 40 millions PHOTOS.
 Daily ,videos uploading on YOUTUBE that can be watched for 1 year
continously.
 Also affect INTERNET SEARCH,FINANCE & BUSINESS INFORMATION
 Challenge include in CAPTURE,SEARCHING,SHARING,ANALY-
SIS,STORAGE & VISUALIZATION of data.
LIMITATION
Can’t Deal With Huge Amount of Data
SO TRADITIONAL APPROACH FAILS
Then the
ACTUAL SOLUTION
of
BIG DATA IS NAMED
 A software framework for distributed processing of large datasets
across large clusters of computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Open-source implementation for Google MAPREDUCE
 Based on a simple data model, anydatawillfit
 2005: Doug Cutting and Michael J. Cafarella and team developed Hadoop
to support distribution for the Nutch search engine project.
 Doug named it after his son's toy elephant
 The project was funded by YAHOO
 2006: Yahoo gave the project to APACHE SOFTWARE FOUNDATION.
WHO USES HADOOP?
Architecture of hdoop
MapReduce
HDFS
Hdoop Common
 A software frameawork for distributing computation of
huge data.
 Consists of two main phases
◦ Map
◦ Reduce
 The Map Task: converts input into individually broken
elements.
 The Reduce Task: takes the output from a map task as
input and combines.
How MapReduce Works??
We Love India We 1 Love 1
Love 1 India 1
India 1 We 2
We Play Cricket We 1 Tennis 1
Play 1 Play 1
Tennis
MAP REDUCE
We Love India
We Play Cricket
HDFS
Distributed File system used by Hadoop is (HDFS).
Based on the Google File System (GFS).
Designed to run on thousands of clusters of small
computers.
HDFS uses a MASTERSLAVE ARCHITECTURE
 Master node is called namenode.
 Slave node is called datanode.
 Master (Name Node) manages the file system metadata.
 Slave( DataNodes) store the actual data.
 A file in an HDFS is split into several blocks
 Blocks are stored in a set of DataNodes.
 NameNode the maps blocks to the DataNodes.
 The DataNodes takes care of read, write, creation and deletion
operatons based on instruction given by NameNode.
Provides access to HDFS.
Contains Java libraries and utilities
Contains the necessary java files &
scripts to start HADOOP.
ADVANTAGES OF HADOOP
Designed to detect & handle
failures.
• Automation distribution of data across
the machines.
Doesn’t rely on hardware for fault
tolerance.
• Servers can be added or removed
dynamically.
ANY QUERIES????

Big data(hadoop)

  • 1.
    A Presentation On Big Data & Hadoop Submitted To:- Mrs.Sonika Narang Mrs. Poonam Beri Submitted By:- Ms. Shabnam 34633
  • 2.
    Big data meansreally a big data, it is a collection of large & complex data that it becomes difficult to process using traditional data processing applications.
  • 3.
    Black Box Data SocialMedia Data Stock Exchange Data Power Grid Data Transport Data
  • 4.
  • 5.
    TYPES OF BIGDATA  Structured Data:-Relational Data  Semi-Structured Data:-XML Data  Unstructured Data:-PDF ,Word ,Text ,Media Logs etc.
  • 6.
     Daily, updationof 0.5 PBs on FACEBOOK including 40 millions PHOTOS.  Daily ,videos uploading on YOUTUBE that can be watched for 1 year continously.  Also affect INTERNET SEARCH,FINANCE & BUSINESS INFORMATION  Challenge include in CAPTURE,SEARCHING,SHARING,ANALY- SIS,STORAGE & VISUALIZATION of data.
  • 7.
    LIMITATION Can’t Deal WithHuge Amount of Data SO TRADITIONAL APPROACH FAILS
  • 8.
  • 9.
     A softwareframework for distributed processing of large datasets across large clusters of computers  Large datasets  Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes  Open-source implementation for Google MAPREDUCE  Based on a simple data model, anydatawillfit
  • 10.
     2005: DougCutting and Michael J. Cafarella and team developed Hadoop to support distribution for the Nutch search engine project.  Doug named it after his son's toy elephant  The project was funded by YAHOO  2006: Yahoo gave the project to APACHE SOFTWARE FOUNDATION.
  • 11.
  • 12.
  • 13.
     A softwareframeawork for distributing computation of huge data.  Consists of two main phases ◦ Map ◦ Reduce  The Map Task: converts input into individually broken elements.  The Reduce Task: takes the output from a map task as input and combines.
  • 14.
    How MapReduce Works?? WeLove India We 1 Love 1 Love 1 India 1 India 1 We 2 We Play Cricket We 1 Tennis 1 Play 1 Play 1 Tennis MAP REDUCE We Love India We Play Cricket
  • 15.
    HDFS Distributed File systemused by Hadoop is (HDFS). Based on the Google File System (GFS). Designed to run on thousands of clusters of small computers. HDFS uses a MASTERSLAVE ARCHITECTURE
  • 16.
     Master nodeis called namenode.  Slave node is called datanode.  Master (Name Node) manages the file system metadata.  Slave( DataNodes) store the actual data.  A file in an HDFS is split into several blocks  Blocks are stored in a set of DataNodes.  NameNode the maps blocks to the DataNodes.  The DataNodes takes care of read, write, creation and deletion operatons based on instruction given by NameNode.
  • 17.
    Provides access toHDFS. Contains Java libraries and utilities Contains the necessary java files & scripts to start HADOOP.
  • 18.
    ADVANTAGES OF HADOOP Designedto detect & handle failures. • Automation distribution of data across the machines. Doesn’t rely on hardware for fault tolerance. • Servers can be added or removed dynamically.
  • 19.