Hadoop入門とクラウド利用

17,492
-1

Published on

Hadoop入門とクラウド利用

  1. 1. Hadoop! 2010/05/16 naoki yanai id:yanaoki 1
  2. 2. Hadoop Hadoop (Elastic MapReduce) 2
  3. 3. naoki yanai (id:yanaoki) Web Hadooop m m iPhone Ruby Java 3
  4. 4. Hadoop 4
  5. 5. Hadoop Java Apache 5
  6. 6. Hadoop Google 2004 MapReduce http://labs.google.com/papers/mapreduce.html Google File System (GFS) http://labs.google.com/papers/gfs.html 2010 Google 6
  7. 7. Hadoop Web → 7
  8. 8. 8
  9. 9. Hadoop 9
  10. 10. Hadoop Yahoo Yahoo Hadoop Facebook Amazon 10
  11. 11. Hadoop RDBMS Join mapreduce join SQL Hadoop 11
  12. 12. Hadoop MapReduce web HDFS RDB Web Hadoop 12
  13. 13. Hadoop MapReduce web HDFS RDB Web Hadoop 13
  14. 14. Hadoop N Hadoop 14
  15. 15. Hadoop MapReduce HDFS Hadoop MapReduce HDFS 15
  16. 16. MapReduce → map → reduce → map reduce hadoop key-value Hadoop 16
  17. 17. HDFS Hadoop MapReduce 17
  18. 18. MapReduce slave MR:TaskTracker master MR:JobTracker slave MR:TaskTracker (Job) (map reduce 18
  19. 19. HDFS slave HDFS:DataNode master HDFS:NameNode slave HDFS:DataNode 19
  20. 20. Hadoop MapReduce HDFS slave MR:TaskTracker master HDFS:DataNode MR:JobTracker slave HDFS:NameNode MR:TaskTracker HDFS:DataNode Hadoop HDFS MapReduce map reduce JobTracker map reduce 20
  21. 21. MapReduce AA A3 AB B2 BC C1 input output map reduce 21
  22. 22. MapReduce Example Google map(String key, String value): / key: document name / / value: document contents / for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values) / key: a word / / values: a list of counts / int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 22
  23. 23. MapReduce A:1 A:1 map A:<1,1,1> A:3 C:1 AA AB C:<1> reduce BC A:1 B:1 map B:2 HDFS B:<1,1> reduce B:1 input map C:1 HDFS shuffle output map reduce (sort) 23
  24. 24. MapReduce Google 24
  25. 25. Hadoop Mahout Hadoop Apache CollaborativeFiltering Classifier Clustering DecisionForest 25
  26. 26. Hadoop 26
  27. 27. Hadoop 27
  28. 28. Amazon Web Service EC2 28
  29. 29. Amazon Web Service WebAPI 29
  30. 30. Amazon Web Service EC2 ( Elastic Compute Cloud ) root/admin S3 ( Simple Storage Service ) EMR ( Elastic MapReduce ) Web Hadoop → MapReduce EC2 S3 +α 30
  31. 31. Elastic MapReduce Hadoop Hadoop input output S3 31
  32. 32. Elastic MapReduce Amazon 32
  33. 33. Elastic MapReduce client cloud master API Job input/output slave S3 slave slave 33
  34. 34. Elastic MapReduce MapReduce 34
  35. 35. Elastic MapReduce Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming http://developer.amazonwebservices.com/connect/ entry.jspa?externalID=2294 Item 35
  36. 36. Elastic MapReduce map/reduce map/reduce input http://www.grouplens.org/ 5 36
  37. 37. Elastic MapReduce input S3 [ ID] [ ID] [ ] map/reduce output S3 [ ID] [ ID] [ ] 37
  38. 38. Elastic MapReduce S S map map reduce map map reduce map map reduce map reduce reduce map reduce reduce map reduce reduce 38
  39. 39. Elastic MapReduce step1 : input key:[] value:[ ID_ ID_ ] map ID key:[ ID] values[ ID_ ] reduce ID output ID ¥t ID_ | ID_ |... 39
  40. 40. Elastic MapReduce step2 : input key:[ ID] value:[ ID_ | ID_ |...] ID map key:[ IDx_ IDy] values[ x_ y] ID reduce output IDx_ _ IDy 40
  41. 41. Elastic MapReduce step3 : input IDx_ _ IDy IDx_(1- ) key map map map key: < IDx_(1- )> values < IDy> reduce 1- output IDx_ IDy_ 41
  42. 42. Elastic MapReduce 42
  43. 43. Elastic MapReduce 1 elastic-mapreduce --create --name "item similarity job" --alive --log-uri s3n://bucket /logs --num-instances 10 --instance-type m1.small --availability-zone us-west-1a 43
  44. 44. EC2 EC2 44
  45. 45. Elastic MapReduce WAITING 45
  46. 46. Elastic MapReduce 2 S3 (s3cmd input map/reduce python s3cmd.rb put bucket :input/input.tsv input.tsv s3cmd.rb put bucket :script/map.py map1.py s3cmd.rb put bucket :script/reduce1.py reduce1.py ... 46
  47. 47. Elastic MapReduce 4 Job elastic-mapreduce --job-flow-id j-2ROU0QKL6KOV6 --json item_similarity.json 47
  48. 48. Elastic MapReduce Step1 RUNNING 48
  49. 49. Elastic MapReduce 5 output s3sync.rb -r --make-dirs bucket :output . elastic-mapreduce --terminate --job-flow-id j-2ROU0QKL6KOV6 49
  50. 50. Hadoop x 50
  51. 51. Hadoop Tom White ( ) ( ) ( ) ¥4,830 51
  52. 52. 52

×