What mapreduce is ?
• Origin from Google (Operating Systems
Design and Implementation 04)
• A sample programming model for...
MapReduce feature
• Parallel
• Run on commodity hardware
• Fault Tolerance
Three phase of MR
• Map
• Shuffle
• Reduce
Example for map
• Let map(k, v) =
•
foreach char c in v:
•
emit(k, c)
• (“A”, “cats”) -> (“A”, “c”), (“A”, “a”),
(“A”, “t”...
Double example
• Let map(k, v) =
•
emit(k.toUpper(), v.toUpper())
• (“foo”, “bar”) -> (“FOO”, “BAR”)
• (“Foo”, “other”) ->...
Triple example
• Let map(k, v) =
•
if (isPrime(v)) then emit(k, v)
• (“foo”, 7) -> (“foo”, 7)
• (“test”, 10) -> (nothing)
Reduce example
let reduce(k, vals) =
sum = 0
foreach int v in vals:
sum +=
emit(k, sum)
(“A”, [42, 100, 312]) -> (“A”, 454...
Interface InputFormat
•
•

Two methods

getSplits
How to split the input data
• getRecordReader
How to read the input data
Caculate the map tasks we need
• Goalsize = Totalsize/mapred.map.tasks
• Mapred.map.tasks(defined in job
configuration ,ju...
Reduce number
• 0.95 ? 1.75 ?
• At 0.95 all of the reduces can launch
immediately and start transfering map
outputs as the...
What HDFS is ?
• Origin from Google again [SOSP’03]
Symposium on Operating Systems
Principles
• Redundant storage of massi...
HDFS feature
• Files stored as blocks
• Reliability through replication
• Single master(NN) coordinates
access,metadata
• ...
NN SPOF and failure resistance
• Store metadata in different place
(local disk / share storage)
Secondary NN
Merge edit lo...
Resource & Event
• http://class10e.com/Cloudera/
• http://blog.cloudera.com/blog/
• Hadoop Summit
http://hadoopsummit.org/...
hadoop introduce
hadoop introduce
hadoop introduce
hadoop introduce
hadoop introduce
Upcoming SlideShare
Loading in …5
×

hadoop introduce

200 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
200
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

hadoop introduce

  1. 1. What mapreduce is ? • Origin from Google (Operating Systems Design and Implementation 04) • A sample programming model for data processing • For large dataset processing
  2. 2. MapReduce feature • Parallel • Run on commodity hardware • Fault Tolerance
  3. 3. Three phase of MR • Map • Shuffle • Reduce
  4. 4. Example for map • Let map(k, v) = • foreach char c in v: • emit(k, c) • (“A”, “cats”) -> (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”)
  5. 5. Double example • Let map(k, v) = • emit(k.toUpper(), v.toUpper()) • (“foo”, “bar”) -> (“FOO”, “BAR”) • (“Foo”, “other”) -> (“FOO”, “OTHER”)
  6. 6. Triple example • Let map(k, v) = • if (isPrime(v)) then emit(k, v) • (“foo”, 7) -> (“foo”, 7) • (“test”, 10) -> (nothing)
  7. 7. Reduce example let reduce(k, vals) = sum = 0 foreach int v in vals: sum += emit(k, sum) (“A”, [42, 100, 312]) -> (“A”, 454) (“B”, [12, 6, -2]) -> (“B”, 16)
  8. 8. Interface InputFormat • • Two methods getSplits How to split the input data • getRecordReader How to read the input data
  9. 9. Caculate the map tasks we need • Goalsize = Totalsize/mapred.map.tasks • Mapred.map.tasks(defined in job configuration ,just a hint)
  10. 10. Reduce number • 0.95 ? 1.75 ? • At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. • At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
  11. 11. What HDFS is ? • Origin from Google again [SOSP’03] Symposium on Operating Systems Principles • Redundant storage of massive amounts of data on cheap and unreliable computers
  12. 12. HDFS feature • Files stored as blocks • Reliability through replication • Single master(NN) coordinates access,metadata • No data caching • Familiar interface ,
  13. 13. NN SPOF and failure resistance • Store metadata in different place (local disk / share storage) Secondary NN Merge edit log with Fsimage Reduce recovery time NN HA
  14. 14. Resource & Event • http://class10e.com/Cloudera/ • http://blog.cloudera.com/blog/ • Hadoop Summit http://hadoopsummit.org/ • Hadoop World http://www.hadoopworld.com/

×