データ解析技術入門(Hadoop編)

4,785 views
4,734 views

Published on

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,785
On SlideShare
0
From Embeds
0
Number of Embeds
775
Actions
Shares
0
Downloads
26
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

データ解析技術入門(Hadoop編)

  1. 1. ( &Hadoop ) 2013 4 12 Takumi Asai
  2. 2. (26 )–– H21 H23 NTT Communications IP– H23 NTT– twitter:@p_i_o4545– blog:http://pioneerinocean.hatenablog.com/ • • R Hadoop ( )– •
  3. 3. ( :4/12) Hadoop( : ) R Ruby R
  4. 4. / / /⇒wikipedia
  5. 5. =
  6. 6. 21 ( )⇒Google,Facebook
  7. 7. 1000 D R
  8. 8. VS IT RDBMSSPSS R IT
  9. 9. VSFSP Web FSP TESCO
  10. 10. VSWinWin
  11. 11. Hadoop Hadoop – Apache Java – Google MapReduce,Google File System(GFS) • google
  12. 12. Hadoop Hadoop – HDFS MapReduce – Hbase HDFS – Google GFS – MapReduce – Google MapReduce – Key-Value Java
  13. 13. HDFSNamenode,2Namenode,Datanode 3 Data Node Data Node Name Node Data Node Data Node Secondary Name Node Data Node Data Node
  14. 14. HDFS• HDFS (64MB ) abcdefg #Block1 hijklmn (64MB) opqrstu abcdefg hijklmn opqrstu vwxyz vwxyz #Block2 (64MB) 150M #Block3 (22MB)
  15. 15. HDFS – – – abcdefg #Block1 Data Node:A has 1,2 hijklmn (64MB) opqrstu Data Node:B has 2,3 vwxyz Data Node:C has 1,3 #Block2 (64MB) Data Node:D has 1 #Block3 (22MB) Data Node:E has 2,3
  16. 16. Namenode(NN)– Namenode– HDFS––Datanode(DN)–– blk_xxxxxx– Secondary Data Node Name Node Name Node
  17. 17. Secondary NamenodeSecondary Namenode(2NN)– 2NN Namenode– Namenode– • 32NN NN– Namenode– Namenode •– 2NN
  18. 18. Namenode !– Namenode HDFS– NN 2NN– HDFS––
  19. 19. HDFS HDFS Data Node Data Node Name Node Active Data Node Data Node Name Node Standby Data Node Data NodeStandby 2NN 2NN
  20. 20. HDFS HDFS – Datanode – Datanode Namenode – Namenode – Namenode ⇔Datanode Datanode⇔Datanode • • • Linux – ls,cat – rwx • x HDFS
  21. 21. MapReduce MapReduce – – – Map/Reduce 2 – Map/Reduce ,Mapper/Reducer – Map,Reduce Shuffle
  22. 22. MapReduceHDFS Task Tracker Task Tracker ( ) Job Tracker Task Tracker ( ) Task Tracker Task Tracker Task Tracker JobTracker TaskTracker
  23. 23. Data Node Data NodeTask Tracker Task Tracker Name Node Job Tracker Data Node Data NodeTask Tracker Task Tracker Secondary Name Node Data Node ※ HDFS Data NodeTask Tracker ※ Mapreduce Task Tracker
  24. 24. Mapreduce YARN – HDFS Mapreduce – YARN(Mapreduce Ver2) – Mapreduce – YARN – YARN
  25. 25. MapReduce WordCount – MapReduce (Hello World ) Hello Hadoop Goodbye World Hello Goodbye World World Hadoop Map <Hello,1> <Hadoop,1> <Goodbye,1> <World,1> <Hello,1> <Goodbye,1> <World,1> <World,1> <Hadoop,1> Shuffle <Goodbye,[1,1]> <Hadoop,[1,1]> <Hello,[1,1]> <World,[1,1,1]> Reduce <Goodbye,2> <Hadoop,2> <Hello,2> <World,3>
  26. 26. MapReduce Mapper Reducer – – – – HDFS ” ” Map reduce Map reduce Map
  27. 27. MapReduce – WordCount – Map Reduce – • fizz buzz fizzbuzz fizz – Ruby Ruby – Map #{ }¥t1 OK – Reduce
  28. 28. MapReduce hdfs Hadoop Hadoop Mapred Mapred Mapred Hadoop Mapred Hadoop 3 hdfs 1 Mapred 4 – OK • #{ }¥t#{ } – cat test.txt | ruby map.rb | sort | ruby reduce.rb • Hadoop
  29. 29. MapReduce :Map hdfs Hadoop Hadoop Mapred Mapred Mapred Hadoop Mapred hdfs 1 Hadoop 1 Hadoop 1 Mapred 1 Mapred 1
  30. 30. Map#!/usr/bin/env rubySTDIN.each_line do |line|line.split.each do |word| puts "#{word}¥t1" endend
  31. 31. Reducewordhash = {}STDIN.each_line do |line| word, count = line.strip.split if wordhash.has_key?(word) wordhash[word] += count.to_i else wordhash[word] = count.to_i endendwordhash.each {|record, count| puts "#{record}¥t#{count}"}
  32. 32. Hadoop Hadoop – – Java OK – • .
  33. 33. Hadoop
  34. 34. Hadoop– • Pig • Hive– • Sqoop– • Mahout– Hadoop • whirr etc…
  35. 35. Hadoop– HDFS • RAID •– HDFS Mapreduce • Amazon S3– •– •– •
  36. 36. (Hadoop)– RDB–– Hive Pig––
  37. 37. (Hadoop)–– HDD– Mapreduce–– Hadoop

×