COOKPADでのHadoop利用

• Hadoop

• Hadoop
•

COOKPAD
896
30 3 1

etc..

GROUP BY

MySQL
( 3.5 )

GROUP BY

MySQL
( 3.5 )

7000

Hadoop

• Google MapReduce OSS
•
•

master( )

slave( )

key Reducer

Mapper Reducer

(2009/10 )

•
•
• MapReduce
• etc..

Hadoop
• Hadoop

• EC2 S3
ver. 0.18.3

• Cloudera & Hadoop Streaming

S3 Native FileSystem
• Hadoop
• 5GB
• s3n:// ← ”n”

S3 Block FileSystem
• Hadoop
• HDFS
•
• s3://

Hadoop++
←Hadoop

↓MySQL

cat hoge.csv | ruby mapper.rb | ruby reducer.rb

Reducer
Mapper Reducer
Mapper→Reducer key Reducer

master

1) -file
master slave scp

hadoop jar xxx.jar
-mapper hoge.rb -reducer fuga.rb
-file hoge.rb -file fuga.rb
-file

2) mapper, reducer
File.open(‘ ’) {|f| ...}

S3

1) -cacheFile
S3 slave

hadoop jar xxx.jar
-mapper hoge.rb -reducer fuga.rb
-ﬁle hoge.rb -ﬁle fuga.rb
-cacheFile s3n://path/to/ #

2) mapper, reducer
File.open(‘ ’) {|f| ...}

p target_ids.size # 50000

ARGF.each do |log|
log.chomp!
id, type, ... = log.split(/,/)
next if target_ids.include?(id)
end

target_ids 5
…

[13930, 29011, 39291, ...] # 50000

1000

{
‘139’ => [13930, 13989, 13991, ...], # 50
‘290’ => [29011, 29098, 29076, ...], # 50
‘392’ => [39291, 39244, 39251, ...], # 50
}

50
hash = Hash.new {|h,k| h[k] = []}
target_ids.each do |id|
hash[ id.to_s[0,3] ] << id
end

ARGF.each do |log|
log.chomp!
id, type, ... = log.split(/,/)
next if hash[ id[0,3] ].include?(id)
end

8 7 8 …

http://ow.ly/2bdW1

java.net.SocketTimeoutException: Read timed out

8Amazon7 Elastic MapReduce 8 …

http://ow.ly/2bdW1


8Amazon7 Elastic MapReduce 8 …

http://ow.ly/2bdW1

Amazon Elastic MapReduce
Hadoop 0.21

COOKPADでのHadoop利用

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to COOKPADでのHadoop利用

Similar to COOKPADでのHadoop利用 (20)

More from Tatsuya Sasaki

More from Tatsuya Sasaki (8)

Recently uploaded

Recently uploaded (20)

COOKPADでのHadoop利用

Editor's Notes