The Family of Hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The Family of Hadoop

  • 5,830 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Good one
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
5,830
On Slideshare
4,602
From Embeds
1,228
Number of Embeds
31

Actions

Shares
Downloads
316
Comments
1
Likes
10

Embeds 1,228

http://namnham.blogspot.com 1,024
http://thetechnicalweb.blogspot.in 55
http://namnham.blogspot.ru 26
http://www.thetechnicalweb.blogspot.in 14
http://thetechnicalweb.blogspot.com 12
http://namnham.blogspot.co.uk 11
http://namnham.blogspot.sg 10
http://www.techgig.com 8
http://namnham.blogspot.jp 8
http://namnham.blogspot.de 7
http://namnham.blogspot.fr 7
http://namnham.blogspot.kr 6
http://www.slideshare.net 5
http://thetechnicalweb.blogspot.com.br 5
http://namnham.blogspot.com.au 4
http://namnham.blogspot.in 3
http://translate.googleusercontent.com 3
http://static.slidesharecdn.com 3
http://namnham.blogspot.com.es 2
http://namnham.blogspot.tw 2
http://namnham.blogspot.ch 2
http://namnham.blogspot.nl 2
http://namnham.blogspot.com.tr 1
http://oracle.sociview.com 1
http://namnham.blogspot.ca 1
http://namnham.blogspot.com.br 1
http://thetechnicalweb.blogspot.de 1
http://thetechnicalweb.blogsopt.com 1
http://namnham.blogspot.gr 1
http://namnham.blogspot.se 1
http://namnham.blogspot.com.ar 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
  • 2. Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
  • 3. History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
  • 4. History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  • 5. History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
  • 6. History Source: http://wiki.apache.org/hadoop/PoweredBy
  • 7. Sub-projects
  • 8. Architecture
  • 9. Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
  • 10. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 11. Data Flow
  • 12. Data Flow
  • 13. Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
  • 14. Programming Model
  • 15. Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
  • 16. Data Flow
  • 17. WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
  • 18. MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
  • 19. MapReduce in Hadoop
  • 20. Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
  • 21. Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
  • 22. Data Model
  • 23. Architecture
  • 24. Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
  • 25. Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
  • 26. Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
  • 27. Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
  • 28. Architecture
  • 29. Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
  • 30. Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF
  • 31. Hive @ Facebook
  • 32. The End Thank you!