created by Doug Cutting, the creator of
Lucene: open source index & search library.
Nutch: Lucene-based web crawler.
Jun 2003, there was a successful 100
million page Nutch demo system.
Nutch problem: its architecture could not
scale to the billions of pages.
Oct 2003, Google published the paper
“The Google File System”.
In 2004, Nutch team wrote an open source implementation
of GFS, called Nutch Distributed File System (NDFS).
Dec 2004, Google published the paper “MapReduce:
Simplified Data Processing on Large Clusters”.
In 2005, Nutch team implemented MapReduce in Nutch.
Mid 2005, all the major Nutch algorithms had been ported
to run using MapReduce and NDFS.
Feb 2006, Nutch's NDFS and the MapReduce
implementation formed Hadoop project.
Doug Cutting joined Yahoo!.
Jan 2008, Hadoop became Apache top-level
Feb 2008, Yahoo! production search index
was generated by a 10,000-core Hadoop
File stored as blocks (default size: 64M)
Reliability through replication
– Each block is replicated to several datanodes
Namenode & Datanodes
– manages the filesystem namespace
– maintains the filesystem tree and metadata for all the
files and directories in the tree.
– store data in the local file system
– Periodically report back to the namenode with lists of all
Clients communicate with both namenode and datanodes.
Data is a stream of keys and values
– Input: <key1,value1> pairs from data source
– Output: immediate <key2,value2> pairs
– Called once per a key, in sorted order
Input: <key2, list of value2>
Output: <key3,value3> pairs
MapReduce in Hadoop
– handling all jobs.
– scheduling tasks on the slaves.
– monitoring & re-executing tasks.
– execute the tasks.
– run an individual map or reduce.
Nov 2006, Google released the paper “Bigtable: A
Distributed Storage System for Structured Data”
BigTable: distributed, column-oriented store, built on top of
Google File System.
HBase: open source implementation of BigTable, built on
top of HDFS.
Data are stored in tables of rows and columns.
Cells are ”versioned”
→ Data are addressed by row/column/version key.
Table rows are sorted by row key, the table's primary key.
Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
Tables are stored in regions.
Region: a row range [start-key : end-key)
– assigns regions to regionservers
– monitors the health of regionservers
– handles administrative funtions
– contain regions and handle client read/write requests
Catalog Tables (ROOT and META)
– maintain the current list, state, recent history, and
location of all regions.
$ bin/hbase shell
started at Facebook
an open source data warehousing solution
built on top of Hadoop
for managing and querying structured data
Hive QL: SQL-like query language
– compiled into map-reduce jobs
log processing, data mining,...
– analogous to tables in RDBMS
– rows are organized into typed columns
– all the data is stored in a directory in HDFS
– determine the distribution of data within sub-directories
of the table directory
– based on the hash of a column in the table
– Each bucket is stored as a file in the partition directory
– contains metadata about data stored in Hive.
– stored in any SQL backend or an embedded Derby.
– Database: a namespace for tables
– Table metadata: column types, physical layout,...
– Partition metadata
Hive Query Language
Data Definition (DDL) statements
– CREATE/DROP/ALTER TABLE
– SHOW TABLE/PARTITIONS
Data Manipulation (DML) statements
– LOAD DATA
User Defined functions: UDF/UDAF