• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Family of Hadoop
 

The Family of Hadoop

on

  • 5,525 views

 

Statistics

Views

Total Views
5,525
Views on SlideShare
4,347
Embed Views
1,178

Actions

Likes
10
Downloads
312
Comments
1

29 Embeds 1,178

http://namnham.blogspot.com 980
http://thetechnicalweb.blogspot.in 55
http://namnham.blogspot.ru 25
http://www.thetechnicalweb.blogspot.in 14
http://thetechnicalweb.blogspot.com 12
http://namnham.blogspot.co.uk 11
http://namnham.blogspot.sg 10
http://www.techgig.com 8
http://namnham.blogspot.fr 7
http://namnham.blogspot.jp 7
http://namnham.blogspot.kr 6
http://namnham.blogspot.de 6
http://thetechnicalweb.blogspot.com.br 5
http://www.slideshare.net 5
http://namnham.blogspot.com.au 4
http://translate.googleusercontent.com 3
http://static.slidesharecdn.com 3
http://namnham.blogspot.in 3
http://namnham.blogspot.nl 2
http://namnham.blogspot.ch 2
http://namnham.blogspot.com.es 2
http://oracle.sociview.com 1
http://namnham.blogspot.com.br 1
http://thetechnicalweb.blogspot.de 1
http://thetechnicalweb.blogsopt.com 1
http://namnham.blogspot.gr 1
http://namnham.blogspot.se 1
http://namnham.blogspot.ca 1
http://namnham.blogspot.com.tr 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Good one
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Family of Hadoop The Family of Hadoop Presentation Transcript

    • The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
    • Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
    • History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
    • History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
    • History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
    • History Source: http://wiki.apache.org/hadoop/PoweredBy
    • Sub-projects
    • Architecture
    • Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
    • Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
    • Data Flow
    • Data Flow
    • Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
    • Programming Model
    • Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
    • Data Flow
    • WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
    • MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
    • MapReduce in Hadoop
    • Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
    • Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
    • Data Model
    • Architecture
    • Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
    • Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
    • Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
    • Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
    • Architecture
    • Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
    • Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF
    • Hive @ Facebook
    • The End Thank you!