Hadoop introduction

Hadoop Introduction
Background && Installation && Hello world && related

Outline

• Background
• Hello world
• Installation
• Related

12/20/12 2

Background
• Why Hadoop?
• Accessible: AWS
• Robust : handle most such failures
• Scalable: linearly
• Simple: 1 == 1 w
• Key Points:
• Scale-out
• Moving code to data

12/20/12 3

Background: History
• Apache Top Project: Doug Cutting
• Lucence -> Nutch -> Hadoop(2004)
• Yahoo (1w)
• Facebook (Hive, Hbase,…)
• HULU (Hbase)
• Baidu (3000TB, one week)
• Twitter (sweat data)

12/20/12 4

Background
• Comparing SQL database and Hadoop
• Structure:
• SQL(structure data, Specific Pattern)
• Hadoop(Key-value, like Text, Picture)
• Scale-out <- scale-up
• Key-Value <- Relation Tables
• Functional Programming <- Declarative Queries
• Offline batch processing <- Online (Once
Write , Read many times)
12/20/12 5

Background – Understanding
• Word Count
• File Size ++ ， Memory Leak
• Disk-Hash Table (More complex)
• Distributed:
• Phase 1: Part Processing
• Phase 2: Merge Results
• Shuffle the partitions the appropriate machines(AlphaBeta)

• Now, We have already finish a minimal Hadoop.

12/20/12 6

Hello World: Word Count
• Two Phase:
• Mapping: 获取输入数据，并将其装载到 mapper 中
• Reducing: 处理来自 mapper 的所有输出，产生最终结果。

• 1.1 list(filename, file content)
• 1.2 list(word, 1)
• 2.1 list(word, list(word))
• 2.2 list(word, count)

12/20/12 7

Hello World
• mapper.py
• Reducer.py

12/20/12 8

Installation
• Mode:
• 单机模式（ default)
• 伪分布模式推荐开发和调试模式
• 全分布模式
• Configuration:
• 基本配置
• Ssh 配置
• Ubuntu 配置

12/20/12 9

Hadoop Framework
• HDFS:
• NameNode : 跟踪，指导，记录
• DataNode ：底层 IO 操作
• Secondary NameNode
• Map Reduce ：
• Job Tracker
• Task Tracker

12/20/12 10

Related
• Programming:
• Java
• Python
• Jython （ Translate Python ）
• Hadoop Streaming （ stdin , stdout ）
• Dumbo
• Happy

12/20/12 11

Related
• Pig: 高级数据流语言
• Hive: SQL 数据仓库
• Hbase ： Google BigTable ，面向列的数据库
• ZookKeeper: 共享状态的协同系统
• Chukwa ：数据收集系统
• Mahout ：数据挖掘与机器学习
• Hama: 矩阵计算

12/20/12 12

Resource
• Book:
• Hadoop In action
• Hadoop 实战（第二版）
• Video && Google Course
• URL:
• 资源收藏

12/20/12 13

Hadoop introduction

More Related Content

What's hot

Viewers also liked

Similar to Hadoop introduction

More from Tianwei Liu

Hadoop introduction

Editor's Notes