SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop Distributed File System

                Dhruba Borthakur
   Apache Hadoop Project Management Committee
              dhruba@apache.org
             dhruba@facebook.com
Hadoop, Why?
•  Need to process huge datasets on large clusters
   of computers
•  Nodes fail every day
   – Failure is expected, rather than exceptional.
   – The number of nodes in a cluster is not constant.
•  Very expensive to build reliability into each
   application.
•  Need common infrastructure
   – Efficient, reliable, easy to use
   – Open Source, Apache License
Hadoop History
•    Dec 2004 – Google paper published
•    July 2005 – Nutch uses new MapReduce implementation
•    Jan 2006 – Doug Cutting joins Yahoo!
•    Feb 2006 – Hadoop becomes a new Lucene subproject
•    Apr 2007 – Yahoo! running Hadoop on 1000-node cluster
•    Jan 2008 – An Apache Top Level Project
•    Feb 2008 – Yahoo! production search index with Hadoop
•    July 2008 – First reports of a 4000-node cluster
Who uses Hadoop?
•    Amazon/A9
•    Facebook
•    Google
•    IBM
•    Joost
•    Last.fm
•    New York Times
•    PowerSet
•    Veoh
•    Yahoo!
What is Hadoop used for?
•  Search
   –  Yahoo, Amazon, Zvents,
•  Log processing
   –  Facebook, Yahoo, ContextWeb. Joost, Last.fm
•  Recommendation Systems
   –  Facebook
•  Data Warehouse
   –  Facebook, AOL
•  Video and Image Analysis
   –  New York Times, Eyealike
Public Hadoop Clouds
•  Hadoop Map-reduce on Amazon EC2 instances
    –  http://wiki.apache.org/hadoop/AmazonEC2
•  IBM Blue Cloud
   –  Partnering with Google to offer web-scale infrastructure
•  Global Cloud Computing Testbed
   –  Joint effort by Yahoo, HP and Intel
Commodity Hardware



Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Goals of HDFS
•  Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
•  Assumes Commodity Hardware
   – Files are replicated to handle hardware failure
   – Detect failures and recovers from them
•  Optimized for Batch Processing
   – Data locations exposed so that computations can
   move to where data resides
   – Provides very high aggregate bandwidth
Distributed File System
•  Single Namespace for entire cluster
•  Data Coherency
   – Write-once-read-many access model
   – Client can only append to existing files
•  Files are broken up into blocks
   – Typically 128 MB block size
   – Each block replicated on multiple DataNodes
•  Intelligent Client
   – Client can find location of blocks
   – Client accesses data directly from DataNode
Functions of a NameNode
•  Manages File System Namespace
   – Maps a file name to a set of blocks
   – Maps a block to the DataNodes where
   it resides
•  Cluster Configuration Management
•  Replication Engine for Blocks
NameNode Metadata
•  Meta-data in Memory
   – The entire metadata is in main memory
   – No demand paging of FS meta-data
•  Types of Metadata
   – List of files
   – List of Blocks for each file
   – List of DataNodes for each block
   – File attributes, e.g access time, replication factor
•  A Transaction Log
   – Records file creations, file deletions. etc
DataNode
•  A Block Server
   – Stores data in the local file system (e.g. ext3)
   – Stores meta-data of a block (e.g. CRC)
   – Serves data and meta-data to Clients
•  Block Report
   – Periodically sends a report of all existing blocks to
   the NameNode
•  Facilitates Pipelining of Data
   – Forwards data to other specified DataNodes
Block Placement
•  Current Strategy
   -- One replica on random node on local rack
   -- Second replica on a random remote rack
   -- Third replica on same remote rack
   -- Additional replicas are randomly placed
•  Clients read from nearest replica
•  Would like to make this policy pluggable
Replication Engine
•  NameNode detects DataNode failures
   – Chooses new DataNodes for new
   replicas
   – Balances disk usage
   – Balances communication traffic to
   DataNodes
Data Correctness
•  Use Checksums to validate data
   – Use CRC32
•  File Creation
   – Client computes checksum per 512 byte
   – DataNode stores the checksum
•  File access
   – Client retrieves the data and checksum
   from DataNode
   – If Validation fails, Client tries other replicas
Namenode Failure
•  A single point of failure
•  Transaction Log stored in multiple
   directories
   – A directory on the local file system
   – A directory on a remote file system
   (NFS/CIFS)
•  Need to develop a real HA solution
Data Pipelining
•  Client retrieves a list of DataNodes on
   which to place replicas of a block
•  Client writes block to the first DataNode
•  The first DataNode forwards the data to
   the next DataNode in the Pipeline
•  When all replicas are written, the Client
   moves on to the next block in file
Secondary NameNode
•  Copies FsImage and Transaction Log from
   NameNode to a temporary directory
•  Merges FSImage and Transaction Log into a
   new FSImage in temporary directory
•  Uploads new FSImage to the NameNode
   – Transaction Log on NameNode is purged
User Interface
•  Command for HDFS User:
   – hadoop dfs -mkdir /foodir
   – hadoop dfs -cat /foodir/myfile.txt
   – hadoop dfs -rm /foodir myfile.txt
•  Command for HDFS Administrator
   – hadoop dfsadmin -report
   – hadoop dfsadmin -decommission datanodename
•  Web Interface
   – http://host:port/dfshealth.jsp
Hadoop Map/Reduce
•  Implementation of the Map-Reduce programming model
   – Framework for distributed processing of large data sets
   – Data handled as collections of key-value pairs
   – Pluggable user code runs in generic framework
•  Very common design pattern in data processing
   – Demonstrated by a unix pipeline example:
   cat * | grep | sort | unique -c | cat > file
   input | map | shuffle | reduce | output
•  Natural for:
    – Log processing
    – Web search indexing
    – Ad-hoc queries
Hadoop Subprojects
•  Hive
  –  A Data Warehouse with SQL support
•  HBase
  – table storage for semi-structured data
•  Zookeeper
  – coordinating distributed applications
•  Mahout
  –  Machine learning
Hadoop at Facebook
•  Hardware
  –  4800 cores, 600 machines, 16GB/8GB per
     machine – Nov 2008
  –  4 SATA disks of 1 TB each per machine
  –  2 level network hierarchy, 40 machines per
     rack
Hadoop at Facebook
•  Single HDFS cluster across all cores
   –  2 PB raw capacity
   –  Ingest rate is 2 TB compressed per day
          – 10 TB uncompressed
   –  12 Million files
Hadoop Growth at Facebook
Data Flow at Facebook



Web
Servers
           Scribe
Servers



                                                  Network

                                                  Storage





 Oracle
RAC
   Hadoop
Cluster
           MySQL

Hadoop Usage at Facebook
•  Statistics per day:
   –  55TB of compressed data scanned per day
   –  3200+ jobs on production cluster per day
   –  80M compute minutes per day
•  Barrier to entry is significantly reduced:
   –  SQL like language called Hive
   –  http://hadoop.apache.org/hive/
Useful Links
•  HDFS Design:
  – http://hadoop.apache.org/core/docs/current/hdfs_design.html

•  Hadoop API:
   – http://lucene.apache.org/hadoop/api/
•  Hadoop Wiki
   –  http://wiki.apache.org/hadoop/

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Database replication
Database replicationDatabase replication
Database replicationArslan111
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Differencejeetendra mandal
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptxSurya937648
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 

What's hot (20)

Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Database replication
Database replicationDatabase replication
Database replication
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Mongodb
MongodbMongodb
Mongodb
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 

Viewers also liked

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3Sung-jae Park
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_SystemBrett Keim
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewelleryChitradevi
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems Ioanna Tsalouchidou
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger SulbhaSulbha Bakshi
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureAngelo Gino Varrati
 

Viewers also liked (20)

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewellery
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger Sulbha
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azure
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
 

Similar to Hadoop Distributed File System

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 

Similar to Hadoop Distributed File System (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

More from elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 

More from elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Hadoop Distributed File System

  • 1. Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
  • 2. Hadoop, Why? •  Need to process huge datasets on large clusters of computers •  Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. •  Very expensive to build reliability into each application. •  Need common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License
  • 3. Hadoop History •  Dec 2004 – Google paper published •  July 2005 – Nutch uses new MapReduce implementation •  Jan 2006 – Doug Cutting joins Yahoo! •  Feb 2006 – Hadoop becomes a new Lucene subproject •  Apr 2007 – Yahoo! running Hadoop on 1000-node cluster •  Jan 2008 – An Apache Top Level Project •  Feb 2008 – Yahoo! production search index with Hadoop •  July 2008 – First reports of a 4000-node cluster
  • 4. Who uses Hadoop? •  Amazon/A9 •  Facebook •  Google •  IBM •  Joost •  Last.fm •  New York Times •  PowerSet •  Veoh •  Yahoo!
  • 5. What is Hadoop used for? •  Search –  Yahoo, Amazon, Zvents, •  Log processing –  Facebook, Yahoo, ContextWeb. Joost, Last.fm •  Recommendation Systems –  Facebook •  Data Warehouse –  Facebook, AOL •  Video and Image Analysis –  New York Times, Eyealike
  • 6. Public Hadoop Clouds •  Hadoop Map-reduce on Amazon EC2 instances –  http://wiki.apache.org/hadoop/AmazonEC2 •  IBM Blue Cloud –  Partnering with Google to offer web-scale infrastructure •  Global Cloud Computing Testbed –  Joint effort by Yahoo, HP and Intel
  • 7. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 8. Goals of HDFS •  Very Large Distributed File System – 10K nodes, 100 million files, 10 PB •  Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them •  Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 9. Distributed File System •  Single Namespace for entire cluster •  Data Coherency – Write-once-read-many access model – Client can only append to existing files •  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes •  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 10.
  • 11. Functions of a NameNode •  Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides •  Cluster Configuration Management •  Replication Engine for Blocks
  • 12. NameNode Metadata •  Meta-data in Memory – The entire metadata is in main memory – No demand paging of FS meta-data •  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g access time, replication factor •  A Transaction Log – Records file creations, file deletions. etc
  • 13. DataNode •  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients •  Block Report – Periodically sends a report of all existing blocks to the NameNode •  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 14. Block Placement •  Current Strategy -- One replica on random node on local rack -- Second replica on a random remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed •  Clients read from nearest replica •  Would like to make this policy pluggable
  • 15. Replication Engine •  NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 16. Data Correctness •  Use Checksums to validate data – Use CRC32 •  File Creation – Client computes checksum per 512 byte – DataNode stores the checksum •  File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 17. Namenode Failure •  A single point of failure •  Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) •  Need to develop a real HA solution
  • 18. Data Pipelining •  Client retrieves a list of DataNodes on which to place replicas of a block •  Client writes block to the first DataNode •  The first DataNode forwards the data to the next DataNode in the Pipeline •  When all replicas are written, the Client moves on to the next block in file
  • 19. Secondary NameNode •  Copies FsImage and Transaction Log from NameNode to a temporary directory •  Merges FSImage and Transaction Log into a new FSImage in temporary directory •  Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 20. User Interface •  Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt •  Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename •  Web Interface – http://host:port/dfshealth.jsp
  • 21. Hadoop Map/Reduce •  Implementation of the Map-Reduce programming model – Framework for distributed processing of large data sets – Data handled as collections of key-value pairs – Pluggable user code runs in generic framework •  Very common design pattern in data processing – Demonstrated by a unix pipeline example: cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output •  Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 22.
  • 23. Hadoop Subprojects •  Hive –  A Data Warehouse with SQL support •  HBase – table storage for semi-structured data •  Zookeeper – coordinating distributed applications •  Mahout –  Machine learning
  • 24. Hadoop at Facebook •  Hardware –  4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008 –  4 SATA disks of 1 TB each per machine –  2 level network hierarchy, 40 machines per rack
  • 25. Hadoop at Facebook •  Single HDFS cluster across all cores –  2 PB raw capacity –  Ingest rate is 2 TB compressed per day – 10 TB uncompressed –  12 Million files
  • 26. Hadoop Growth at Facebook
  • 27. Data Flow at Facebook Web
Servers
 Scribe
Servers
 Network
 Storage
 Oracle
RAC
 Hadoop
Cluster
 MySQL

  • 28. Hadoop Usage at Facebook •  Statistics per day: –  55TB of compressed data scanned per day –  3200+ jobs on production cluster per day –  80M compute minutes per day •  Barrier to entry is significantly reduced: –  SQL like language called Hive –  http://hadoop.apache.org/hive/
  • 29. Useful Links •  HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html •  Hadoop API: – http://lucene.apache.org/hadoop/api/ •  Hadoop Wiki –  http://wiki.apache.org/hadoop/