MapReduce 簡單介紹與練習

MAP REDUCE
https://goo.gl/dSsBqp
May 21, 2015

巨量資料
■ Google 處理 Web
■ 20+ billion web pages x 20KB = 400+ TB
■ 單台電腦讀取硬碟速度 30-35 MB/sec。需 4 個月讀取
整個 web
■ 200 * 2TB 硬碟來儲存

巨量資料
■ 分散式計算、分散式儲存
■ 2011 年時 Google 有 1M 台機器

MapReduce
■ 挑戰
■ 如何分散計算、分散資料？
■ 撰寫分散式/平行程式很困難？
■ MapReduce 解決上述問題
■ Google 所設計的運算與資料處理模型
■ 簡單且優雅的運用在大資料處理，撰寫平行程式不再
困難

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n' | sort | uniq -c
2 ...
3 645 and
4 2 animation
5 3 annotations
6 3 answers
7 1 anticipated
8 19 any
9 2 anymore
10 2 anything
11 39 apos
12 ...
6

範例: Word Count
1 $ cat data
2 ...
3 On January 2, 1985, Zaman Akil sent the Academy of Scie
4 At his request, the Perpetual Secretary of the Academy,
5 I was the only one who agreed to discuss it with the au
6 ...
7

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n'
2 ...
3 Germain
4 sent
5 the
6 letter
7 to
8 several
9 members
10 of
11 ...
9

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n' | sort
2 ...
3 Also
4 Also
5 Also
6 Although
7 Although
8 America
9 American
10 American
11 Among
12 An
13 An 11

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n' | sort | uniq -c
2 ...
3 645 and
4 2 animation
5 3 annotations
6 3 answers
7 1 anticipated
8 19 any
9 2 anymore
10 2 anything
11 39 apos
12 ...
13

WordCount Functions
1 def map(key, value):
2 # key: NA; value: a line of input text
3 for word in value:
4 emit(word, 1)
5 def reduce(key, values):
6 # key: word; values: an iterator over counts
7 result = 0
8 for count in values:
9 result += count
10 emit(key, result)
14

MapReduce Implementations
■ Hadoop
■ 1999 Doug Cutting 開發搜尋引擎開放軟體 Apache
Lucent, Nutch
■ 2004 Google 揭露搜尋引擎的作法: MapReduce +
Google 分散式檔案系統
■ 2004 Doug Cutting 開發 Hadoop 以配合 Lucent,
Nutch(Yahoo 贊助)

MapReduce Implementations
■ Spark
■ Designed for performance
■ APIs for Scala, Java, and Python
■ Disco: MapReduce framework for Python
■ MapReduce-MPI: for distributed-memory parallel
machines on top of standard MPI message passing
■ Meguro: a simple Javascript Map/Reduce framework
■ bashreduce : mapreduce in a bash script
■ localmapreduce

Hadoop Distributions & Services
Apache page of Distributions
■ Amazon Web Services
■ Apache Bigtop
■ Cascading
■ Cloudera
■ Datameer
■ Hortonworks
■ IBM InfoSphere BigInsights
■ MapR Technologies
■ Syncsort
■ Tresata
■ ...

Hadoop Streaming
■ 使用 stdin, stdout 做資料的傳遞。類似 shell 上的
pipeline。
■ 可用任何可在 shell 上執行的指令作為 mapper 或
reducer。亦即可用任何語言做 mapreduce 計算。
Mapper
■ Input: 檔案整行作為 value（預設）
■ Output: 第一個 tab 以前為 key，其後為 value
Reducer
■ Input: 第一個 tab 以前為 key，其後為 value。相同
key 的行會連續出現。
■ Output: 整行印出（預設）

實際動手跑 (lmr)
下載
■ ngramcount.py: https://goo.gl/wZ41MH
■ localmapreduce: https://goo.gl/8UlChs
■ citeseerx.40000: https://goo.gl/RmbfYm
Mac 環境準備
■ 安裝 Homebrew http://brew.sh/
■ 安裝 GNU parallel
brew install parallel
20

實際手動跑 (lmr)
■ 使用 pipe 測試 ngramcount.py
1 $ head -100 citeseerx.40000 | ./ngramcount.py -m |
2 > sort -k1,1 -t$'t' | ./ngramcount.py -r | less
■ 使用 lmr(localmapreduce) 分散執行
1 $ pv citeseerx.40000 |
2 > lmr 300k 20 './ngramcount.py -m' './ngramcount.py -r' out
3 $ ls out
4 reducer-00 reducer-02 reducer-04 reducer-06 reducer-08 reducer-10 re
5 reducer-01 reducer-03 reducer-05 reducer-07 reducer-09 reducer-11 re
6 $ less out/*
21

Ngram Count I
1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 from __future__ import unicode_literals, print_function
4
5
6 def ngrams(words):
7 for length in range(1, 5 + 1):
8 for ngram in zip(*(words[i:] for i in range(length))):
9 yield ngram
10
11
12 def mapper(line):
13 # from nltk.tokenize import word_tokenize
14 # words = word_tokenize(line.lower())
15 import re
16 words = re.findall(r'[a-z]+', line.lower())
17 for ngram in ngrams(words):
22

Ngram Count II
18 yield ' '.join(ngram), 1
19
20
21 def reducer(key, values):
22 count = sum(int(v) for v in values)
23 yield key, count
24
25
26 def do_mapper(files):
27 import fileinput
28 for line in fileinput.input(files):
29 for key, value in mapper(line):
30 print('{}t{}'.format(key, value))
31
32
33 def line_to_keyvalue(line):
34 key, value = line.decode('utf8').split('t', 1)
35 return key, value
36
23

Ngram Count III
37
38 def do_reducer(files):
39 import fileinput
40 from itertools import groupby, imap
41 keyvalues = imap(line_to_keyvalue, fileinput.input(files))
42 for key, grouped_keyvalues in groupby(keyvalues,
43 key=lambda x: x[0]):
44 values = (v for k, v in grouped_keyvalues)
45 for key, value in reducer(key, values):
46 print('{}t{}'.format(key, value))
47
48
49 def argparser():
50 import argparse
51 parser = argparse.ArgumentParser(description='N-gram counter')
52 mode_group = parser.add_mutually_exclusive_group(required=True)
53 mode_group.add_argument(
54 '-r', '--reducer', action='store_true', help='reducer mode')
55 mode_group.add_argument(
24

Ngram Count IV
56 '-m', '--mapper', action='store_true', help='mapper mode')
57 parser.add_argument('files', metavar='FILE', type=str, nargs='*',
58 help='input files')
59 return parser.parse_args()
60
61 if __name__ == '__main__':
62
63 args = argparser()
64 if args.mapper:
65 do_mapper(args.files)
66 elif args.reducer:
67 do_reducer(args.files)
25

修改練習
修改 ngramcount.py
■ 只產生 1-2grams
■ 只保留 count > 5 的 n-grams

練習：語言搜尋引擎
■ 建立一搜尋引擎用於搜尋英文詞語用法。
■ 可輔助英語學習與文章寫作。
搜尋例子
■ adj. beach: 即代表搜尋 beach 前面出現過的形容詞。
■ play * role: 搜尋 play 與 role 中間最常出現的字詞
組合。
■ go ?to home: go 與 home 之間是否要放 to。
■ go * movie: go 與 role 中間最常出現的字詞組合。
■ kill the _: 最常被 kill 的東西是。

語法設計
語法說明
_ 單一任意字詞
* 零到多個任意字詞
?term term 可有可無
term1 | term2 term1 或 term2
adj. det. n. v. prep.形容詞、冠詞、名詞、動詞、介繫詞
搜尋例子
■ adj. beach: 即代表搜尋 beach 前面出現過的形容詞。
■ play * role: 搜尋 play 與 role 中間最常出現的字詞組
合。
■ go ?to home: go 與 home 之間是否要放 to。
■ go * movie: go 與 role 中間最常出現的字詞組合。
■ kill the _: 最常被 kill 的東西是。

語言搜尋引擎
■ 目標：完成語法第一項 _
■ 任意位置置入 _
■ 最長 4-gram
Query 範例
■ play _ _ role
■ kill the _
■ a _ beach
■ 輸入資料：citeseerx 的許多句子
■ 輸出結果：
■ key: 所有會有結果的 query
■ value: 符合 query 的前 100 名 ngram 與 count。

語言搜尋引擎 - 輸出
■ key: 所有會有結果的 query
■ value: 符合 query 的前 100 名 ngram 與 count。
輸出範例
Key Ngrams Counts
a _ beach a sandy beach 486
a private beach 416
a beautiful beach 314
a small beach 175
...
kill the _ kill the people 189
kill the other 174
kill the process 163
kill the enemy 160
...

隨堂練習
目標
■ 依 MapReduce 架構，設計每階段 mapper, reduce 的
輸入輸出來完成 Lab 12
■ 在紙寫撰寫簡單輸入、輸出的 key-value 範例表達概念
即可
小提示
■ 可有 1 至多個 map, reduce 流程
■ 考慮 mapper 的輸入資料切割影響
■ mapper 輸入為 value 或 key-value，輸出為 key-value
■ reducer 輸入為 grouped key-values，輸出為
key-value

Bi-gram Count
Bi-gram Count Mapper 範例
Input(value) Output(key => value)
C D C D C D => 2
D C => 1
B C D A B C => 1
C D => 1
D A => 1
C D A B C D => 1
D A => 1
A B => 1
Reducer 範例
Input(key => value) Output(key => value)
A B => 1 A B => 1
B C => 1 B C => 1
C D => 2 C D => 4
C D => 1
C D => 1
D A => 1 D A => 2
D A => 1
D C => 2 C C => 2

語言搜尋引擎
Mapper 範例
A B C 200 A B C => A B C 200
_ B C => A B C 200
A _ C => A B C 200
A B _ => A B C 200
_ _ C => A B C 200
_ B _ => A B C 200
A _ _ => A B C 200
A D C 300 _ D C => A D C 300
A _ C => A D C 300
...
A E C 100 _ E C => A E C 100
A _ C => A E C 100
...

語言搜尋引擎
Reducer 範例
A _ C => A B C 200 A _ C => A D C 300,
A _ C => A D C 300 A B C 200,
A _ C => A E C 100 A E C 100
A B _ => A B C 200 A B _ => A B C 200
A D _ => A D C 300 A D _ => A D C 300
A E _ => A E C 100 A E _ => A E C 100
A _ _ => A B C 200 A _ _ => A D C 300,
A _ _ => A D C 300 A B C 200,
A _ _ => A E C 100 A E C 100
_ B C => A B C 200 _ B C => A B C 200
_ D C => A D C 300 _ D C => A D C 300
_ E C => A E C 100 _ E C => A E C 100
... ...

回家作業
需完成六支程式
■ 產生 ngram count 的 mapper, reducer
■ 產生 query result 的 mapper, reducer
■ 將 query result 轉為 database
(試試 python 內建的 shelve 或 sqlite3 套件)
■ Database 介面程式，讓使用者輸入 query ，即時取得
result

python shelve
1 import shelve
2 d = shelve.open('data.shelve')
3 d['odds'] = [1, 3, 5, 7, 9]
4 print d['odds']
5 d['evens'] = [2, 4, 6, 8, 10]
6 d['hello'] = 'world'
7 del d['hello']
8 d['zipcodes'] = {'hsinchu': 300, 'zhongli': 320}
9 print d.keys()
10 d.close()
Google “python shelve” for official documents
38

MapReduce 簡單介紹與練習

More Related Content

What's hot

Viewers also liked

Similar to MapReduce 簡單介紹與練習

MapReduce 簡單介紹與練習