Hadoop Introduction
   Background && Installation && Hello world && related
Outline

•   Background
•   Hello world
•   Installation
•   Related




12/20/12           2
Background
• Why Hadoop?
   • Accessible: AWS
   • Robust : handle most such failures
   • Scalable: linearly
   • Simple: 1 == 1 w
• Key Points:
   • Scale-out
   • Moving code to data

12/20/12                                  3
Background: History
• Apache Top Project: Doug Cutting
• Lucence -> Nutch -> Hadoop(2004)
   • Yahoo (1w)
   • Facebook (Hive, Hbase,…)
   • HULU (Hbase)
   • Baidu (3000TB, one week)
   • Twitter (sweat data)


12/20/12                             4
Background
• Comparing SQL database and Hadoop
   • Structure:
      • SQL(structure data, Specific Pattern)
      • Hadoop(Key-value, like Text, Picture)
   • Scale-out <- scale-up
   • Key-Value <- Relation Tables
   • Functional Programming <- Declarative Queries
   • Offline batch processing <- Online (Once
     Write , Read many times)
12/20/12                                         5
Background – Understanding
• Word Count
     • File Size ++ , Memory Leak
     • Disk-Hash Table (More complex)
     • Distributed:
         • Phase 1: Part Processing
         • Phase 2: Merge Results
            • Shuffle the partitions the appropriate machines(AlphaBeta)

     • Now, We have already finish a minimal Hadoop.



12/20/12                                                                   6
Hello World: Word Count
• Two Phase:
     • Mapping: 获取输入数据,并将其装载到 mapper 中
     • Reducing: 处理来自 mapper 的所有输出,产生最终结果。

•   1.1    list(filename, file content)
•   1.2    list(word, 1)
•   2.1    list(word, list(word))
•   2.2    list(word, count)



12/20/12                                     7
Hello World
• mapper.py
• Reducer.py




12/20/12       8
Installation
• Mode:
   • 单机模式( default)
   • 伪分布模式 推荐开发和调试模式
   • 全分布模式
• Configuration:
   • 基本配置
   • Ssh 配置
   • Ubuntu 配置

12/20/12               9
Hadoop Framework
• HDFS:
   • NameNode : 跟踪,指导,记录
   • DataNode :底层 IO 操作
   • Secondary NameNode
• Map Reduce :
   • Job Tracker
   • Task Tracker


12/20/12                   10
Related
• Programming:
   • Java
   • Python
      • Jython ( Translate Python )
      • Hadoop Streaming ( stdin , stdout )
      • Dumbo
      • Happy


12/20/12                                      11
Related
•   Pig: 高级数据流语言
•   Hive: SQL 数据仓库
•   Hbase : Google BigTable , 面向列的数据库
•   ZookKeeper: 共享状态的协同系统
•   Chukwa : 数据收集系统
•   Mahout :数据挖掘与机器学习
•   Hama: 矩阵计算


12/20/12                                12
Resource
• Book:
   • Hadoop In action
   • Hadoop 实战 (第二版)
• Video && Google Course
• URL:
   • 资源收藏




12/20/12                   13
thanks




12/20/12            14

Hadoop introduction

  • 1.
    Hadoop Introduction Background && Installation && Hello world && related
  • 2.
    Outline • Background • Hello world • Installation • Related 12/20/12 2
  • 3.
    Background • Why Hadoop? • Accessible: AWS • Robust : handle most such failures • Scalable: linearly • Simple: 1 == 1 w • Key Points: • Scale-out • Moving code to data 12/20/12 3
  • 4.
    Background: History • ApacheTop Project: Doug Cutting • Lucence -> Nutch -> Hadoop(2004) • Yahoo (1w) • Facebook (Hive, Hbase,…) • HULU (Hbase) • Baidu (3000TB, one week) • Twitter (sweat data) 12/20/12 4
  • 5.
    Background • Comparing SQLdatabase and Hadoop • Structure: • SQL(structure data, Specific Pattern) • Hadoop(Key-value, like Text, Picture) • Scale-out <- scale-up • Key-Value <- Relation Tables • Functional Programming <- Declarative Queries • Offline batch processing <- Online (Once Write , Read many times) 12/20/12 5
  • 6.
    Background – Understanding •Word Count • File Size ++ , Memory Leak • Disk-Hash Table (More complex) • Distributed: • Phase 1: Part Processing • Phase 2: Merge Results • Shuffle the partitions the appropriate machines(AlphaBeta) • Now, We have already finish a minimal Hadoop. 12/20/12 6
  • 7.
    Hello World: WordCount • Two Phase: • Mapping: 获取输入数据,并将其装载到 mapper 中 • Reducing: 处理来自 mapper 的所有输出,产生最终结果。 • 1.1 list(filename, file content) • 1.2 list(word, 1) • 2.1 list(word, list(word)) • 2.2 list(word, count) 12/20/12 7
  • 8.
    Hello World • mapper.py •Reducer.py 12/20/12 8
  • 9.
    Installation • Mode: • 单机模式( default) • 伪分布模式 推荐开发和调试模式 • 全分布模式 • Configuration: • 基本配置 • Ssh 配置 • Ubuntu 配置 12/20/12 9
  • 10.
    Hadoop Framework • HDFS: • NameNode : 跟踪,指导,记录 • DataNode :底层 IO 操作 • Secondary NameNode • Map Reduce : • Job Tracker • Task Tracker 12/20/12 10
  • 11.
    Related • Programming: • Java • Python • Jython ( Translate Python ) • Hadoop Streaming ( stdin , stdout ) • Dumbo • Happy 12/20/12 11
  • 12.
    Related • Pig: 高级数据流语言 • Hive: SQL 数据仓库 • Hbase : Google BigTable , 面向列的数据库 • ZookKeeper: 共享状态的协同系统 • Chukwa : 数据收集系统 • Mahout :数据挖掘与机器学习 • Hama: 矩阵计算 12/20/12 12
  • 13.
    Resource • Book: • Hadoop In action • Hadoop 实战 (第二版) • Video && Google Course • URL: • 资源收藏 12/20/12 13
  • 14.

Editor's Notes

  • #2 素材天下 sucaitianxia.com