0
The Family of Hadoop
            Nham Xuan Nam
     nhamxuannam [at] gmail.com
     http://namnham.blogspot.com




    Ba...
Content
   History
   Sub-projects
   HDFS
   Map Reduce
   HBase
   Hive
History
   created by Doug Cutting, the creator of
    Lucene.
   Lucene: open source index & search library.
   Nutch:...
History
 Oct 2003, Google published the paper
“The Google File System”.
   In 2004, Nutch team wrote an open source impl...
History
   Feb 2006, Nutch's NDFS and the MapReduce
    implementation formed Hadoop project.
   Doug Cutting joined Yah...
History




Source: http://wiki.apache.org/hadoop/PoweredBy
Sub-projects
Architecture
Data Model
   File stored as blocks (default size: 64M)
   Reliability through replication
    – Each block is replicate...
Namenode & Datanodes
   Namenode (master)
    – manages the filesystem namespace
    – maintains the filesystem tree and ...
Data Flow
Data Flow
Accessibility
   FileSystem Java API
    – org.apache.hadoop.fs.*

   Web Interface

   Commands for HDFS users
$ hadoo...
Programming Model
Programming Model
   Data is a stream of keys and values
   Map

    – Input: <key1,value1> pairs from data source

    ...
Data Flow
WordCount Example
 File01:                                  File02:
 Hello Barcamp Hello Everyone             Hello Hadoop...
MapReduce in Hadoop
   JobTracker (master)
    – handling all jobs.
    – scheduling tasks on the slaves.
    – monitorin...
MapReduce in Hadoop
Introduction
   Nov 2006, Google released the paper “Bigtable: A
    Distributed Storage System for Structured Data”
   ...
Data Model
   Data are stored in tables of rows and columns.
   Cells are ”versioned”
→ Data are addressed by row/column...
Data Model
Architecture
Architecture
   Master Server
    – assigns regions to regionservers
    – monitors the health of regionservers
    – han...
Accessibility
   Client API
org.apache.hadoop.hbase
.client.*

   HBase Shell
$ bin/hbase shell
hbase> 

   Web Interfa...
Introduction
   started at Facebook
   an open source data warehousing solution
    built on top of Hadoop
   for manag...
Data Model
   Tables
    – analogous to tables in RDBMS
    – rows are organized into typed columns
    – all the data is...
Architecture
Architecture
   Metastore
    – contains metadata about data stored in Hive.
    – stored in any SQL backend or an embedd...
Hive Query Language
   Data Definition (DDL) statements
    – CREATE/DROP/ALTER TABLE
    – SHOW TABLE/PARTITIONS

   Da...
Hive @ Facebook
The End




Thank you!
The Family of Hadoop
The Family of Hadoop
The Family of Hadoop
The Family of Hadoop
Upcoming SlideShare
Loading in...5
×

The Family of Hadoop

4,655

Published on

Published in: Technology, Education
1 Comment
10 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,655
On Slideshare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
319
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Transcript of "The Family of Hadoop"

  1. 1. The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
  2. 2. Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
  3. 3. History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
  4. 4. History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  5. 5. History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
  6. 6. History Source: http://wiki.apache.org/hadoop/PoweredBy
  7. 7. Sub-projects
  8. 8. Architecture
  9. 9. Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
  10. 10. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  11. 11. Data Flow
  12. 12. Data Flow
  13. 13. Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
  14. 14. Programming Model
  15. 15. Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
  16. 16. Data Flow
  17. 17. WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
  18. 18. MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
  19. 19. MapReduce in Hadoop
  20. 20. Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
  21. 21. Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
  22. 22. Data Model
  23. 23. Architecture
  24. 24. Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
  25. 25. Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
  26. 26. Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
  27. 27. Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
  28. 28. Architecture
  29. 29. Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
  30. 30. Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF
  31. 31. Hive @ Facebook
  32. 32. The End Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×