0
Expected … what to be said!
● History.
● What is Hadoop.
● Hadoop vs SQl.
● MapReduce.
● Hadoop Building Blocks.
● Install...
Hadoop Series Resources
How hadoop was born?
Doug Cutting
Challenges of Distributed Processing of
Large Data
● How to distribute the work?
● How to store and distribute the data it...
Hadoop tackles these
challenges!
So, what’s Hadoop?
What is Hadoop?
Hadoop is an open source framework for writing and
running distributed applications that process large
amo...
Hadoop vs SQL
● Structured and Unstructured data.
● Datastore and Data Analysis.
● Scale-out and Scale-up.
● Offline batch...
Hadoop Uses
MapReduce
What is MapReduce?...
● Parallel programming model for clusters of
commodity machines.
● MapReduce provides:
o Automatic parallelization & distr...
MapReduce … Map then Reduce
Keys and Values
● Key/Value pairs.
● Keys divide Reduce Space.
Input Output
Map <k1, v1> list(<k2, v2>)
Reduce <k2, list(v...
WordCount in Action
Input:
foo.txt:
“This is the foo file”
bar.txt:
“And this is the bar one”
1
is
1
the
1
foo
1
file
1
an...
WordCount with MapReduce
map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T ...
Hadoop Building Blocks
How does hadoop work?...
Hadoop Building Blocks
1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
HDFS: NameNode and DataNodes
JobTracker and TaskTracker
Typical Hadoop Cluster
Running Hadoop
Three modes to run Hadoop:
1. Local (standalone) mode.
2. Pseudo-distributed mode “cluster of one” .
3. Ful...
An Action
Running Hadoop on Local Machine
Actions ...
1. Installing Hadoop.
2. Configuring Hadoop (Pseudo-distributed mode).
3. Running WordCount example.
4. Web-ba...
HDFS
1. HDFS is a filesystem designed for large-scale
distributed data processing.
2. HDFS isn’t a native Unix filesystem....
Anatomy of a MapReduce program
MapReduce and beyond
Hadoop
1. Data Types
2. Mapper
3. Reducer
4. Partitioner
5. Combiner
6. Reading and Writing
a. InputFormat
b. OutputFormat
Anatomy of a MapReduce program
Hadoop Data Types
● Certain defined way of serializing key/value pairs.
● Values should implement Writable Interface.
● Ke...
Mapper
Mapper
1. Mapper<K1,V1,K2,V2>
2. Override method:
void map(K1 key, V1 value, Context context)
3. Use context.write(K2, V2)...
WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWr...
Predefined Mappers
Reducer
Reducer
1. Extends Reducer<K1,V1,K2,V2>
2. Overrides method:
void reduce(K2, Iterable<V2>, Context context)
3. Use context...
WordCount Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Te...
Predefined Reducers
Partitioner
Partitioner
The partitioner decides
which key goes where
class WordSizePartitioner extends
Partitioner<Text, IntWritable> ...
Combiner
Combiner
It’s a local Reduce Task at
Mapper.
WordCout Mapper Output:
1. Without Combiner:<the, 1>, <file,
1>, <the, 1>, …
...
Reading and Writing
Reading and Writing
1. Input data usually resides in large files.
2. MapReduce’s processing power is the splitting of the
...
InputFormat Classes
● TextInputFormat
o <offset, line>
● KeyValueTextInputFormat
o keytvaue => <key, value>
● NLineInputFo...
1. The output has no splits.
2. Each reducer generates output file named
part-nnnnn, where nnnnn is the partition ID
of th...
Recap
END OF SESSION #1
Q
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Upcoming SlideShare
Loading in...5
×

Introduction to MapReduce and Hadoop

563

Published on

This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.

This session was given in Arabic and i may provide a video for the session soon.

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
563
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
50
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • https://sites.google.com/site/hadoopintroduction/home/what-is-hadoop
  • Lucene is a full featured text indexer and searching library.
    Nutch was trying to build a complete web search engine with Lucene, it has web crawler and HTML parser and so on..
    Problem: There are billions of web pages there!! What can the poor Nutch do?
    > Google announced GFS and MapReduce 2004, they said that they are using these techniques in their search engine … realy? :/ <
    Doug and his team used these techniques for nutch and then Hadoop was born.

    Doug Cutting
  • Challenges in processing Large Data in a distributed way.
  • Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2).

    Robust—Because it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully
    handle most such failures.

    Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.

    Simple—Hadoop allows users to quickly write efficient parallel code.


    Hadoop in Action section 1.2
  • REF:
    https://sites.google.com/site/hadoopintroduction/home/comparing-sql-databases-and-hadoop
  • REF:
    https://developer.yahoo.com/hadoop/tutorial/module4.html
  • Table from “Hadoop In Action”
    Images source:
    https://developer.yahoo.com/hadoop/tutorial/module4.html
  • Pseudo-code for map and reduce functions for word counting
    Source: Hadoop In Action
  • We now know a general overview about mapreduce, let’s see how hadoop works
  • Hadoop In Action Figure 2.1
  • Local (standalone) mode.
    No HDFS.
    No Hadoop Daemons.
    Debugging and testing the logic of MapReduce program.
    Pseudo-distributed mode.
    All daemons running on a single machine.
    Debugging your code, allowing you to examine memory usage, HDFS input/out­put issues, and other daemon interactions.
    Fully distributed mode.
  • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  • This slide is initially left blank.
  • https://developer.yahoo.com/hadoop/tutorial/module4.html
  • This slide is initially left blank.
  • When the reducer task receives the output from the various mappers, it sorts the
    incoming data on the key of the (key/value) pair and groups together all values of
    the same key.
  • When the reducer task receives the output from the various mappers, it sorts the
    incoming data on the key of the (key/value) pair and groups together all values of
    the same key.
  • Transcript of "Introduction to MapReduce and Hadoop"

    1. 1. Expected … what to be said! ● History. ● What is Hadoop. ● Hadoop vs SQl. ● MapReduce. ● Hadoop Building Blocks. ● Installing, Configuring and Running Hadoop. ● Anatomy of MapReduce program.
    2. 2. Hadoop Series Resources
    3. 3. How hadoop was born? Doug Cutting
    4. 4. Challenges of Distributed Processing of Large Data ● How to distribute the work? ● How to store and distribute the data itself? ● How to overcome failures? ● How to balance the load? ● How to deal with unstructured data? ● ...
    5. 5. Hadoop tackles these challenges! So, what’s Hadoop?
    6. 6. What is Hadoop? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Key distinctions of Hadoop: ● Accessible ● Robust ● Scalable ● Simple
    7. 7. Hadoop vs SQL ● Structured and Unstructured data. ● Datastore and Data Analysis. ● Scale-out and Scale-up. ● Offline batch processing and Online transactions.
    8. 8. Hadoop Uses MapReduce What is MapReduce?...
    9. 9. ● Parallel programming model for clusters of commodity machines. ● MapReduce provides: o Automatic parallelization & distribution. o Fault tolerance. o Locality of data. What is MapReduce?
    10. 10. MapReduce … Map then Reduce
    11. 11. Keys and Values ● Key/Value pairs. ● Keys divide Reduce Space. Input Output Map <k1, v1> list(<k2, v2>) Reduce <k2, list(v2)> list(<k3, v3>)
    12. 12. WordCount in Action Input: foo.txt: “This is the foo file” bar.txt: “And this is the bar one” 1 is 1 the 1 foo 1 file 1 and 1 this 1 is 1 the 1 Reduce#2: Input: Output: is, [1, 1] is, 2 Reduce#1: Input: Output: this, [1, 1] this, 2 Reduce#3: Input: Output: foo, [1] foo, 1. . Final output: this 2 is 2 the 2 foo 1 file 1 and 1 bar 1 one 1
    13. 13. WordCount with MapReduce map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
    14. 14. Hadoop Building Blocks How does hadoop work?...
    15. 15. Hadoop Building Blocks 1. NameNode 2. DataNode 3. Secondary NameNode 4. JobTracker 5. TaskTracker
    16. 16. HDFS: NameNode and DataNodes
    17. 17. JobTracker and TaskTracker
    18. 18. Typical Hadoop Cluster
    19. 19. Running Hadoop Three modes to run Hadoop: 1. Local (standalone) mode. 2. Pseudo-distributed mode “cluster of one” . 3. Fully distributed mode.
    20. 20. An Action Running Hadoop on Local Machine
    21. 21. Actions ... 1. Installing Hadoop. 2. Configuring Hadoop (Pseudo-distributed mode). 3. Running WordCount example. 4. Web-based cluster UI.
    22. 22. HDFS 1. HDFS is a filesystem designed for large-scale distributed data processing. 2. HDFS isn’t a native Unix filesystem. Basic File Commands: $ hadoop fs -cmd <args> $ hadoop fs –ls $ hadoop fs –mkdir /user/chuck $ hadoop fs -copyFromLocal
    23. 23. Anatomy of a MapReduce program MapReduce and beyond
    24. 24. Hadoop 1. Data Types 2. Mapper 3. Reducer 4. Partitioner 5. Combiner 6. Reading and Writing a. InputFormat b. OutputFormat
    25. 25. Anatomy of a MapReduce program
    26. 26. Hadoop Data Types ● Certain defined way of serializing key/value pairs. ● Values should implement Writable Interface. ● Keys should implement WritableComparable interface. ● Some predefined classes: o BooleanWritable. o ByteWritable. o IntWritable o ...
    27. 27. Mapper
    28. 28. Mapper 1. Mapper<K1,V1,K2,V2> 2. Override method: void map(K1 key, V1 value, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
    29. 29. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
    30. 30. Predefined Mappers
    31. 31. Reducer
    32. 32. Reducer 1. Extends Reducer<K1,V1,K2,V2> 2. Overrides method: void reduce(K2, Iterable<V2>, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
    33. 33. WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
    34. 34. Predefined Reducers
    35. 35. Partitioner
    36. 36. Partitioner The partitioner decides which key goes where class WordSizePartitioner extends Partitioner<Text, IntWritable> { @Override public int getPartition(Text word, IntWritable count, int numOfPartions) { return 0; } }
    37. 37. Combiner
    38. 38. Combiner It’s a local Reduce Task at Mapper. WordCout Mapper Output: 1. Without Combiner:<the, 1>, <file, 1>, <the, 1>, … 2. With Combiner:<the, 2>, <file, 2>, ...
    39. 39. Reading and Writing
    40. 40. Reading and Writing 1. Input data usually resides in large files. 2. MapReduce’s processing power is the splitting of the input data into chunks(InputSplit). 3. Hadoop’s FileSystem provides the class FSDataInputStream for file reading. It extends DataInputStream with random read access.
    41. 41. InputFormat Classes ● TextInputFormat o <offset, line> ● KeyValueTextInputFormat o keytvaue => <key, value> ● NLineInputFormat o <offset, nLines> You can define your own InputFormat class ...
    42. 42. 1. The output has no splits. 2. Each reducer generates output file named part-nnnnn, where nnnnn is the partition ID of the reducer. Predefined OutputFormat classes: > TextOutputFormat <k, v> => ktv OutputFormat
    43. 43. Recap
    44. 44. END OF SESSION #1
    45. 45. Q
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×