Hadoop introduction
Upcoming SlideShare
Loading in...5
×
 

Hadoop introduction

on

  • 562 views

 

Statistics

Views

Total Views
562
Views on SlideShare
562
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 素材天下 sucaitianxia.com

Hadoop introduction Hadoop introduction Presentation Transcript

  • Hadoop Introduction Background && Installation && Hello world && related
  • Outline• Background• Hello world• Installation• Related12/20/12 2
  • Background• Why Hadoop? • Accessible: AWS • Robust : handle most such failures • Scalable: linearly • Simple: 1 == 1 w• Key Points: • Scale-out • Moving code to data12/20/12 3
  • Background: History• Apache Top Project: Doug Cutting• Lucence -> Nutch -> Hadoop(2004) • Yahoo (1w) • Facebook (Hive, Hbase,…) • HULU (Hbase) • Baidu (3000TB, one week) • Twitter (sweat data)12/20/12 4
  • Background• Comparing SQL database and Hadoop • Structure: • SQL(structure data, Specific Pattern) • Hadoop(Key-value, like Text, Picture) • Scale-out <- scale-up • Key-Value <- Relation Tables • Functional Programming <- Declarative Queries • Offline batch processing <- Online (Once Write , Read many times)12/20/12 5
  • Background – Understanding• Word Count • File Size ++ , Memory Leak • Disk-Hash Table (More complex) • Distributed: • Phase 1: Part Processing • Phase 2: Merge Results • Shuffle the partitions the appropriate machines(AlphaBeta) • Now, We have already finish a minimal Hadoop.12/20/12 6
  • Hello World: Word Count• Two Phase: • Mapping: 获取输入数据,并将其装载到 mapper 中 • Reducing: 处理来自 mapper 的所有输出,产生最终结果。• 1.1 list(filename, file content)• 1.2 list(word, 1)• 2.1 list(word, list(word))• 2.2 list(word, count)12/20/12 7
  • Hello World• mapper.py• Reducer.py12/20/12 8
  • Installation• Mode: • 单机模式( default) • 伪分布模式 推荐开发和调试模式 • 全分布模式• Configuration: • 基本配置 • Ssh 配置 • Ubuntu 配置12/20/12 9
  • Hadoop Framework• HDFS: • NameNode : 跟踪,指导,记录 • DataNode :底层 IO 操作 • Secondary NameNode• Map Reduce : • Job Tracker • Task Tracker12/20/12 10
  • Related• Programming: • Java • Python • Jython ( Translate Python ) • Hadoop Streaming ( stdin , stdout ) • Dumbo • Happy12/20/12 11
  • Related• Pig: 高级数据流语言• Hive: SQL 数据仓库• Hbase : Google BigTable , 面向列的数据库• ZookKeeper: 共享状态的协同系统• Chukwa : 数据收集系统• Mahout :数据挖掘与机器学习• Hama: 矩阵计算12/20/12 12
  • Resource• Book: • Hadoop In action • Hadoop 实战 (第二版)• Video && Google Course• URL: • 资源收藏12/20/12 13
  • thanks12/20/12 14