COOKPAD
Hadoop
http://cookpad.com/
@sasata299 (   )
@sasata299 (   )
@sasata299 (   )
•          Hadoop

• Hadoop
•
•          Hadoop

• Hadoop
•
COOKPAD
               896
          30         3   1
COOKPAD
               896
          30         3   1




                             etc..
”   ”
GROUP BY




                 MySQL
(   3.5   )
GROUP BY




                 MySQL
(   3.5   )




          7000
…
…
…
Hadoop


• Google MapReduce OSS
•
•
Hadoop
  •
  •
  •
master(             )




                             slave(        )




             key   Reducer


Mapper                           Reducer
(2009/10   )



•
•
•   MapReduce
•               etc..
Hadoop
•      Hadoop

• EC2 S3
                            ver. 0.18.3




• Cloudera & Hadoop Streaming
S3 Native FileSystem
  • Hadoop
  •                    5GB
  •            s3n:// ← ”n”



S3 Block FileSystem
  • Hadoop
      • HDFS
  •
  •            s3://
7000
  1/223




 30
Hadoop++
   ←Hadoop


      ↓MySQL
•          Hadoop

• Hadoop
•
cat hoge.csv | ruby mapper.rb | ruby reducer.rb



Reducer
 Mapper                               Reducer
 Mapper→Reducer              key       Reducer
master

1) -file
               master   slave                   scp

      hadoop jar xxx.jar
       -mapper hoge.rb -reducer fuga.rb
       -file hoge.rb -file fuga.rb
       -file

2) mapper, reducer
          File.open(‘            ’) {|f| ...}
S3

1) -cacheFile
            S3                    slave

      hadoop jar xxx.jar
       -mapper hoge.rb -reducer fuga.rb
       -file hoge.rb -file fuga.rb
       -cacheFile s3n://path/to/        #


2) mapper, reducer
      File.open(‘         ’) {|f| ...}
p target_ids.size # 50000

   ARGF.each do |log|
    log.chomp!
    id, type, ... = log.split(/,/)
    next if target_ids.include?(id)
   end

target_ids   5
                                      …
[13930, 29011, 39291, ...] # 50000

                  1000


{
    ‘139’ => [13930, 13989, 13991, ...], # 50
    ‘290’ => [29011, 29098, 29076, ...], # 50
    ‘392’ => [39291, 39244, 39251, ...], # 50
}
50
hash = Hash.new {|h,k| h[k] = []}
target_ids.each do |id|
  hash[ id.to_s[0,3] ] << id
end

ARGF.each do |log|
 log.chomp!
 id, type, ... = log.split(/,/)
 next if hash[ id[0,3] ].include?(id)
end
•          Hadoop

• Hadoop
•
8              7                                        8      …




                                                  http://ow.ly/2bdW1



S3 Native FileSystem
java.net.SocketTimeoutException: Read timed out
8Amazon7 Elastic             MapReduce                  8      …




                                                  http://ow.ly/2bdW1



S3 Native FileSystem
java.net.SocketTimeoutException: Read timed out
8Amazon7 Elastic             MapReduce                  8      …




                                                  http://ow.ly/2bdW1


Amazon Elastic MapReduce
S3 Native FileSystem
java.net.SocketTimeoutException: Read timed out
   Hadoop 0.21
Amazon Elastic MapReduce
COOKPADでのHadoop利用
COOKPADでのHadoop利用

COOKPADでのHadoop利用

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
    Hadoop • Hadoop •
  • 7.
    Hadoop • Hadoop •
  • 8.
    COOKPAD 896 30 3 1
  • 9.
    COOKPAD 896 30 3 1 etc..
  • 10.
  • 14.
    GROUP BY MySQL ( 3.5 )
  • 15.
    GROUP BY MySQL ( 3.5 ) 7000
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Hadoop • • •
  • 21.
    master( ) slave( ) key Reducer Mapper Reducer
  • 22.
    (2009/10 ) • • • MapReduce • etc..
  • 23.
    Hadoop • Hadoop • EC2 S3 ver. 0.18.3 • Cloudera & Hadoop Streaming
  • 24.
    S3 Native FileSystem • Hadoop • 5GB • s3n:// ← ”n” S3 Block FileSystem • Hadoop • HDFS • • s3://
  • 26.
  • 27.
    Hadoop++ ←Hadoop ↓MySQL
  • 28.
    Hadoop • Hadoop •
  • 29.
    cat hoge.csv |ruby mapper.rb | ruby reducer.rb Reducer Mapper Reducer Mapper→Reducer key Reducer
  • 31.
    master 1) -file master slave scp hadoop jar xxx.jar -mapper hoge.rb -reducer fuga.rb -file hoge.rb -file fuga.rb -file 2) mapper, reducer File.open(‘ ’) {|f| ...}
  • 32.
    S3 1) -cacheFile S3 slave hadoop jar xxx.jar -mapper hoge.rb -reducer fuga.rb -file hoge.rb -file fuga.rb -cacheFile s3n://path/to/ # 2) mapper, reducer File.open(‘ ’) {|f| ...}
  • 34.
    p target_ids.size #50000 ARGF.each do |log| log.chomp! id, type, ... = log.split(/,/) next if target_ids.include?(id) end target_ids 5 …
  • 35.
    [13930, 29011, 39291,...] # 50000 1000 { ‘139’ => [13930, 13989, 13991, ...], # 50 ‘290’ => [29011, 29098, 29076, ...], # 50 ‘392’ => [39291, 39244, 39251, ...], # 50 }
  • 36.
    50 hash = Hash.new{|h,k| h[k] = []} target_ids.each do |id| hash[ id.to_s[0,3] ] << id end ARGF.each do |log| log.chomp! id, type, ... = log.split(/,/) next if hash[ id[0,3] ].include?(id) end
  • 37.
    Hadoop • Hadoop •
  • 38.
    8 7 8 … http://ow.ly/2bdW1 S3 Native FileSystem java.net.SocketTimeoutException: Read timed out
  • 39.
    8Amazon7 Elastic MapReduce 8 … http://ow.ly/2bdW1 S3 Native FileSystem java.net.SocketTimeoutException: Read timed out
  • 40.
    8Amazon7 Elastic MapReduce 8 … http://ow.ly/2bdW1 Amazon Elastic MapReduce S3 Native FileSystem java.net.SocketTimeoutException: Read timed out Hadoop 0.21
  • 41.