4. History of MapReduce
• First implemented
by Google
• Used in CouchDB,
Hadoop, etc.
• Helps to “distill” data into
a concentrated result set
Tuesday, January 8, 13
6. What is MapReduce?
sum = 0
input = ["deer", "bear",
input.each do |x|
"river", "car", "car", "river", input.map! { |x| [x, 1] }
sum += x[1]
"deer", "car", "bear"]
end
Tuesday, January 8, 13
8. History of Hadoop
•Doug Cutting @ Yahoo!
•It is a Toy Elephant
•It is also a framework for
distributed computing
•It is a distributed filesystem
Tuesday, January 8, 13
11. Hadoop Cluster
NameNode
•Keeps track of the DataNodes
•Uses “heartbeat” to determine a node’s health
•The most resources should be spent here
555.555.1.* 555.555.2.* 444.444.1.*
JobTracker NameNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode
♥ TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
Tuesday, January 8, 13
12. Hadoop Cluster
DataNode
•Stores filesystem blocks
•Can be scaled. Spun up/down.
•Replicate based on a set replication factor
555.555.1.* 555.555.2.* 444.444.1.*
JobTracker NameNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
Tuesday, January 8, 13
13. Hadoop Cluster
JobTracker
•Delegates which TaskTrackers should handle a
MapReduce job
•Communicates with the NameNode to assign a TaskTracker
close to the DataNode where the source exists
555.555.1.* 555.555.2.* 444.444.1.*
JobTracker NameNode TaskTracker/DataNode
TaskTracker/DataNode
♥ TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
Tuesday, January 8, 13
14. Hadoop Cluster
TaskTracker
•Worker for MapReduce jobs
•The closer to the DataNode with the data, the better
555.555.1.* 555.555.2.* 444.444.1.*
JobTracker NameNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode
Tuesday, January 8, 13
19. Hadoop Streaming
Pig Hive Wukong
Pig Latin SQL-ish Ruby!
Hadoop Ecosystem
Tuesday, January 8, 13
20. Wukong
•Infochimps
•Currently going through
heavy development
•Use the 3.0.0.pre3 gem
https://github.com/infochimps-labs/wukong/tree/3.0.0
•Model your jobs with
wukong-hadoop
https://github.com/infochimps-labs/wukong-hadoop
Tuesday, January 8, 13
21. Wukong
Wukong wukong-hadoop
•Write mappers and reducers •A CLI to use with Hadoop
using Ruby •Created around building tasks
•As of 3.0.0, Wukong uses with Wukong
“Processors”, which are Ruby •Better than piping in the shell
classes that define map, reduce,
(you can see this with --dry_run)
and other tasks
Tuesday, January 8, 13
22. Wukong Processors
Wukong.processor(:mapper) do
field :min_length, Integer, :default => 1
field :max_length, Integer, :default => 256
field :split_on, Regexp, :default => /s+/
field :remove, Regexp, :default => /[^a-zA-Z0-9']+/
field :fold_case, :boolean, :default => false
def process string
•Fields are accessible tokenize(string).each do |token|
yield token if acceptable?(token)
end
through switches in shell end
private
•Local hand-off is made at def tokenize string
string.split(split_on).map do |token|
STDOUT to STDIN stripped = token.gsub(remove, '')
fold_case ? stripped.downcase : stripped
end
end
def acceptable? token
(min_length..max_length).include?(token.length)
end
end
Tuesday, January 8, 13
23. Wukong Processors
Wukong.processor(:reducer, Wukong::Processor::Accumulator) do
attr_accessor :count
def start record
self.count = 0
end
def accumulate record
self.count += 1
end
def finalize
yield [key, count].join("t")
end
end
Tuesday, January 8, 13
24. Wukong Processors
wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb
--mode=local
--input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub
Simpsons - Ep 8
do 7
Doctor 1
Does 2
doesn't 1
dog 2
D'oh 1
doif 1
doing 2
done 1
doneYou 1
don't 10
Don't 1
Tuesday, January 8, 13
25. The End
Thank you!
@tomeara
ted@tedomeara.com
Tuesday, January 8, 13