20141111 파이썬으로 Hadoop MR프로그래밍

HadoopStreaming
IT가맹점개발팀
이태영
2014.11.11
5번째스터디
파이썬으로MR 개발하기

•개발자, 팀이익숙한언어를사용해서MR 개발가능
•특정언어에서제공하는라이브러리사용가능
•표준I/O로데이터교환-자바MR에비해성능저하
•그러나개발생산성이보장받는다면?
HadoopStreaming은표준입출력(stdio)를제공하는모든언어이용가능
※ 두가지요소가정의되어야함
1.Map 기능이정의된실행가능Mapper 파일
2.Reduce 기능이정의된실행가능Reducer 파일
HadoopStreaming

MapReduce
1.MAP의역할-표준입력으로입력데이터처리
2.MAP의역할-표준출력으로Key, Value 출력
3.REDUCER역할-MAP의출력<Key, Value>를표준입력으로처리
4.REDUCER 역할-표준출력으로Key, Value 출력
데이터
입력
파이썬
Map 처리
파이썬
Reduce 처리
PIPE
파일읽기,
PIPE,
스트리밍등
MR 처리
결과
출력

Python 설치
표준I/O 데이터Mapper 예제
1.python 사이트에서2.7.8 다운로드후압축해제
2.계정홈디렉토리에서python 심볼릭링크로파이썬디렉토리경로생성
3../configure
4../make
python명령어도
귀찮으니py로축약

print ‘Hello BC’
hello.py
매번py를쳐주기귀찮다.
파이썬스크립트자체를실행파일로!
#!/home/hadoop/python/py
print ‘Hello BC’
hello.py
[hadoop@big01 ~]$ chmod 755 hello.py
[hadoop@big01 ~]$ ./hello.py
Hello BC
[hadoop@big01 ~]$ pyhello.py
Hello BC
Python 실행
Hello BC 예제실행
#! (SHA BANG)

import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
mapper.py
Python MAP
표준I/O Mapper 실행예제

[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1
bc1
bc1
bc1
bc1
card1
card1
it1
첫번째필드기준정렬
Python MAP
Mapper 출력값을정렬

import sys
current_word = None
current_count = 0
word = None
line = line.strip()
word, count= line.split('t',1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '{0}t{1}'.format(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '{0}t{1}'.format(current_word, current_count)
reducer.py
기준단어와같다면카운트+1
기준단어가None이아니라면
M/R 결과출력
새로운기준단어설정
마지막라인처리용
Python REDUCE
표준I/O의Reducer 예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py
bc4
card2
it1

Python ♥Hadoop
HadoopStreaming
1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다.
[OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py…
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py…
2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정
조건
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 24 more
안하면

Python ♥Hadoop
HadoopStreaming
hadoop jar hadoop-streaming-2.5.1.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /usr/bin/wc
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
Hadoop 2.x의HadoopStreaming위치
hadoop command [genericOptions] [streamingOptions]

Python ♥Hadoop
HadoopStreaming 명령어(command)
Parameter
Optional/Required
Description
-inputdirectoryname or filename
Required
Input location for mapper
-outputdirectoryname
Required
Output location for reducer
-mapperexecutable or JavaClassName
Required
Mapper executable
-reducerexecutable or JavaClassName
Required
Reducer executable
-filefilename
Optional
Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName
Optional
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName
Optional
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName
Optional
Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName
Optional
Combiner executable for map output
-cmdenv name=value
Optional
Pass environment variable to streaming commands
-inputreader
Optional
For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose
Optional
Verbose output
-lazyOutput
Optional
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks
Optional
Specify the number of reducers
-mapdebug
Optional
Script to call when map task fails
-reducedebug
Optional
Script to call when reduce task fails

Python ♥Hadoop
HadoopStreaming 제네릭옵션
Parameter
Optional/Required
Description
-conf configuration_file
Optional
Specify an application configuration file
-D property=value
Optional
Use value for given property
-fs host:port or local
Optional
Specify a namenode
-files
Optional
Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars
Optional
Specify comma-separated jar files to include in the classpath
-archives
Optional
Specify comma-separated archives to be unarchived on the compute machines
사용예
hadoop jar hadoop-streaming-2.5.1.jar
-D mapreduce.job.reduces=2
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /usr/bin/wc

Python ♥Hadoop
HadoopStreaming 실행: WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0%
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49
File System Counters
…..

Python ♥Hadoop
HadoopStreaming 결과확인

Python ♥Hadoop
….
you'd8
you'll4
you're15
you've5
you,25
you,'6
you--all1
you--are1
you.1
you.'1
you:1
you?2
you?'7
young5
your62
yours1
yours."'1
yourself5
yourself!'1
yourself,1
yourself,'1
yourself.'2
youth,3
youth,'3
zigzag,1
part-00000 를열어보면

Python ♥Hadoop
HadoopStreaming 예제: WordCount 고도화
import sys
Import re
line = line.strip()
line = re.sub('[=.#/?:$'!,"}]', '', line)
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
mapper.py 수정
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice2
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
정규표현식, 특수문자제거

Python ♥Hadoop
…..
ye;1
year2
years2
yelled1
yelp1
yer4
yesterday3
yet18
yet--Oh1
yet--and1
yet--its1
you357
you)1
you--all1
you--are1
youd8
youll4
young5
your62
youre15
yours2
yourself10
youth6
youve5
zigzag1
wc_alice2의part-00000 를열어보면

20141111 파이썬으로 Hadoop MR프로그래밍

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 20141111 파이썬으로 Hadoop MR프로그래밍

Similar to 20141111 파이썬으로 Hadoop MR프로그래밍 (20)

Recently uploaded

Recently uploaded (20)

20141111 파이썬으로 Hadoop MR프로그래밍