5. print ‘Hello BC’
hello.py
매번py를쳐주기귀찮다.
파이썬스크립트자체를실행파일로!
#!/home/hadoop/python/py
print ‘Hello BC’
hello.py
[hadoop@big01 ~]$ chmod 755 hello.py
[hadoop@big01 ~]$ ./hello.py
Hello BC
[hadoop@big01 ~]$ pyhello.py
Hello BC
Python 실행
Hello BC 예제실행
#! (SHA BANG)
6. #!/home/hadoop/python/py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
mapper.py
Python MAP
표준I/O Mapper 실행예제
7. [hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py
bc1
bc1
bc1
card1
bc1
card1
it1
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1
bc1
bc1
bc1
bc1
card1
card1
it1
첫번째필드기준정렬
Python MAP
Mapper 출력값을정렬
8. import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count= line.split('t',1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '{0}t{1}'.format(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '{0}t{1}'.format(current_word, current_count)
reducer.py
기준단어와같다면카운트+1
기준단어가None이아니라면
M/R 결과출력
새로운기준단어설정
마지막라인처리용
Python REDUCE
표준I/O의Reducer 예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py
bc4
card2
it1
9. Python ♥Hadoop
HadoopStreaming
1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다.
[OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py…
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py…
2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정
조건
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 24 more
안하면
11. Python ♥Hadoop
HadoopStreaming 명령어(command)
Parameter
Optional/Required
Description
-inputdirectoryname or filename
Required
Input location for mapper
-outputdirectoryname
Required
Output location for reducer
-mapperexecutable or JavaClassName
Required
Mapper executable
-reducerexecutable or JavaClassName
Required
Reducer executable
-filefilename
Optional
Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName
Optional
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName
Optional
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName
Optional
Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName
Optional
Combiner executable for map output
-cmdenv name=value
Optional
Pass environment variable to streaming commands
-inputreader
Optional
For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose
Optional
Verbose output
-lazyOutput
Optional
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks
Optional
Specify the number of reducers
-mapdebug
Optional
Script to call when map task fails
-reducedebug
Optional
Script to call when reduce task fails
hadoop command [genericOptions] [streamingOptions]
12. Python ♥Hadoop
HadoopStreaming 제네릭옵션
Parameter
Optional/Required
Description
-conf configuration_file
Optional
Specify an application configuration file
-D property=value
Optional
Use value for given property
-fs host:port or local
Optional
Specify a namenode
-files
Optional
Specify comma-separated files to be copied to the Map/Reduce cluster
-libjars
Optional
Specify comma-separated jar files to include in the classpath
-archives
Optional
Specify comma-separated archives to be unarchived on the compute machines
hadoop command [genericOptions] [streamingOptions]
사용예
hadoop jar hadoop-streaming-2.5.1.jar
-D mapreduce.job.reduces=2
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /usr/bin/wc
13. Python ♥Hadoop
HadoopStreaming 실행: WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0%
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0%
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100%
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49
File System Counters
…..
14. Python ♥Hadoop
HadoopStreaming 실행: WordCount
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0%
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0%
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100%
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49
File System Counters
…..
17. Python ♥Hadoop
HadoopStreaming 예제: WordCount 고도화
#!/home/hadoop/python/py
import sys
Import re
for line in sys.stdin:
line = line.strip()
line = re.sub('[=.#/?:$'!,"}]', '', line)
words = line.split()
for word in words:
print '{0}t{1}'.format(word, 1)
mapper.py 수정
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar
-input alice -output wc_alice2
-mapper mapper.py-reducer reducer.py
-file mapper.py -file reducer.py
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040
정규표현식, 특수문자제거