Hadoop Installation &
MapReduce Programming
CS267 - Data Mining & Machine Learning
-Kuldeep Dhole
WHW
Why: To be able to deal with Big Data Mining.
How: By learning Hadoop & MR programming
What: Hadoop Installation, HDFS basics, & MR
programming for Hadoop
Hadoop Installation
Amazon EC2 cloud - Cloudera’s Hadoop
Insallationhttps://www.dropbox.com/s/s8zc3iwlq
936hak/Amazon_Cloudera_Hadoop.pdf
Hadoop Components
- HDFS (Hadoop Distributed File System)
- MapReduce Model
HDFS Shell
CLUSTER / LOCAL MACHINE
/home/user1
File System of Local OS (Linux,
Windows, etc.)
> ls -l
> mv f1 f2
> cp f1 f2
HDFS - /tmp
>hadoop fs -ls
>hadoop fs -mv hdfs_f1 hdfs_f2
>hadoop fs -cp hdfs_f1 hdfs_f2
- HDFS has its own shell
commands
- You need to transfer data:
LOCAL FS <-> HDFS
- Same concept applies to all
machines in the cluster &
Hadoop realm on all machines
are in sync.
MapReduce Concept
- Programming Model for Distributed Parallel
Computing.
- Used on scalable commodity hardware cluster.
- Can process Big Data (100s of GBs, TBs)
- Based on Key-Value structure.
- Parallel MAP tasks, which emit <K, V> data
- Parallel REDUCE tasks, which processes <K, V[ ]>
data
MapReduce Model
M1
M2
M3
M4
R1
R2
R3
R4
<K, V>
<K, V>
<K, V>
<K, V>
Sort, Merge &
Shuffle
<K, V>
<K, V>
<K, V>
<K, V>
<K1, V [ ] >
<K2 V [ ] >
<K3, V [ ] >
<K4, V [ ] >
<K, V>
<K, V>
<K, V>
<K, V>
MapReduce Model In Brief
(K1, V1) -> MAP -> List(K2, V2)
(K2, List(V2) -> REDUCE -> List(K3, V3)
Hadoop MapReduce Application
- Implemented in Java
- Components:
- Mapper
- Reducer
- Job Configuration
- Can be done in other languages like Python,
Perl, Shell, etc. using Streaming Concept.
Complete Application
public class YourApp {
Mapper {
}
Reducer {
}
Job Configuration {
}
}
Mapper Class & Function
public static class YourMap extends Mapper<K1, V1, K2, V2> {
public void map(K1 key, V1 value, Context context) throws
IOException, InterruptedException {
//DO YOUR PROCESSING ON Key ,
Value
//K2 NewKey
//V2 NewValue
context.write(NewKey, NewValue);
}
}
Reducer Class & Function
public static class YourReduce extends Reducer<K2, V2, K3, V3> {
public void reduce(K2 key, Iterable<V2> values, Context
context)
throws IOException, InterruptedException {
//DO YOUR PROCESSING ON Key , Value
//K3 NewKey
//V3 NewValue
context.write(NewKey, NewValue);
}
}
What are I/O Formats?
Job Configuration
public static void main(String[] args) throws Exception {
//Create Configuration
Configuration conf = new Configuration();
//Create Job
Job job = new Job(conf, "YourApp");
//Specify Input Directory
FileInputFormat.addInputPath(job, new Path(args[0]));
//Specify Output Directory
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Map.class);
//Specify Input Split Format By Which Mapper Reads <K, V>
job.setInputFormatClass(KeyValueTextInputFormat.class)
//Specify Output Format By Which Mapper Emits <K, V>
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
//Specify Output Format By Which Reducer Emits <K, V>
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(Value.class);
//Specify Output Format By Which Output is Written To The Output Files
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(org.myorg.YourApp.class);
job.waitForCompletion(true);
}
Reverse Indexing Application
Input File:
Output File:
/hdfs/f1.dat
/hdfs_op/o1.dat
f1 w1 w2
w3 w4
f2 w2 w3
w4 w5
f3 w3 w4
w5 w6
w1 f1
w4 f1, f2,
f3
w2 f1, f2
w5 f2, f3
w3 f1, f2,
f3
w6 f3
Hadoop System
Job:
CONF
REDUCE
MAP
Mapper & Reducer Algo
Mapper:
read line in K<filename>, V<rest contents>
tokenize V
for every token t:
emit K<t>, V<filename>
Reducer:
receive K<token>, V[ ] <filenames>
make unique list of V [ ]
form a comma separated string of filenames in V [ ] as str
emit K<token>, V<str>
Actual Java Program: Mapper
public static class Map extends Mapper<Text, Text, Text, Text> {
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String temp = tokenizer.nextToken();
//Strip last non-alphabet chars from a word
if ( ! temp.matches(".*[a-zA-Z]$") ) {
word.set(temp.substring(0, temp.length()-1));
}
else
word.set(temp);
context.write(word, key);
}
}
}
Actual Java Program: Reducer
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String doc_list = "";
HashMap<String, Integer> map = new HashMap<String, Integer>();
for (Text val : values) {
map.put(val.toString(), 1);
}
Iterator<String> keySetIterator = map.keySet().iterator();
while(keySetIterator.hasNext()){
String k = keySetIterator.next();
doc_list += k + ",";
}
if (doc_list.length() > 0 && doc_list.charAt(doc_list.length()-1)==',') {
doc_list = doc_list.substring(0, doc_list.length()-1);
}
context.write(key, new Text(doc_list));
}
}
Actual Java Program: Main()
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "reverse-index");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Map.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setJarByClass(org.myorg.ReverseIndex.class);
job.waitForCompletion(true);
}
Actual Java Program: Complete App
package org.myorg;
//IMPORT RELEVANT API Libraries
public class AppName {
Mapper() {
}
Reducer() {
}
Main() {
}
}
How to Exeucute?
- Compile/usr/java/jdk1.7.0_25/bin/javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar
-d classes ip1/ReverseIndex.java
- make a JAR
/usr/java/jdk1.7.0_25/bin/jar -cvf jar/reverse_index.jar -C classes/ .
- submit the JAR as JOB
hadoop jar jar/reverse_index.jar org.myorg.ReverseIndex ip op
DEMO
Important Links
A few examples at my github:
https://github.com/dkuldeep11/hadoop
Clear Basics: https://www.udacity.com/course/ud617
Hadoop MR Concept:
http://developer.yahoo.com/hadoop/tutorial/module4.html#basics
MR Coding Basics:
http://hadoop.apache.org/docs/stable1/mapred_tutorial.html
In Depth: http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-mapreduce-
programming/
Thank You!
Q/A

Cs267 hadoop programming

  • 1.
    Hadoop Installation & MapReduceProgramming CS267 - Data Mining & Machine Learning -Kuldeep Dhole
  • 2.
    WHW Why: To beable to deal with Big Data Mining. How: By learning Hadoop & MR programming What: Hadoop Installation, HDFS basics, & MR programming for Hadoop
  • 3.
    Hadoop Installation Amazon EC2cloud - Cloudera’s Hadoop Insallationhttps://www.dropbox.com/s/s8zc3iwlq 936hak/Amazon_Cloudera_Hadoop.pdf
  • 4.
    Hadoop Components - HDFS(Hadoop Distributed File System) - MapReduce Model
  • 5.
    HDFS Shell CLUSTER /LOCAL MACHINE /home/user1 File System of Local OS (Linux, Windows, etc.) > ls -l > mv f1 f2 > cp f1 f2 HDFS - /tmp >hadoop fs -ls >hadoop fs -mv hdfs_f1 hdfs_f2 >hadoop fs -cp hdfs_f1 hdfs_f2 - HDFS has its own shell commands - You need to transfer data: LOCAL FS <-> HDFS - Same concept applies to all machines in the cluster & Hadoop realm on all machines are in sync.
  • 6.
    MapReduce Concept - ProgrammingModel for Distributed Parallel Computing. - Used on scalable commodity hardware cluster. - Can process Big Data (100s of GBs, TBs) - Based on Key-Value structure. - Parallel MAP tasks, which emit <K, V> data - Parallel REDUCE tasks, which processes <K, V[ ]> data
  • 7.
    MapReduce Model M1 M2 M3 M4 R1 R2 R3 R4 <K, V> <K,V> <K, V> <K, V> Sort, Merge & Shuffle <K, V> <K, V> <K, V> <K, V> <K1, V [ ] > <K2 V [ ] > <K3, V [ ] > <K4, V [ ] > <K, V> <K, V> <K, V> <K, V>
  • 8.
    MapReduce Model InBrief (K1, V1) -> MAP -> List(K2, V2) (K2, List(V2) -> REDUCE -> List(K3, V3)
  • 9.
    Hadoop MapReduce Application -Implemented in Java - Components: - Mapper - Reducer - Job Configuration - Can be done in other languages like Python, Perl, Shell, etc. using Streaming Concept.
  • 10.
    Complete Application public classYourApp { Mapper { } Reducer { } Job Configuration { } }
  • 11.
    Mapper Class &Function public static class YourMap extends Mapper<K1, V1, K2, V2> { public void map(K1 key, V1 value, Context context) throws IOException, InterruptedException { //DO YOUR PROCESSING ON Key , Value //K2 NewKey //V2 NewValue context.write(NewKey, NewValue); } }
  • 12.
    Reducer Class &Function public static class YourReduce extends Reducer<K2, V2, K3, V3> { public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException, InterruptedException { //DO YOUR PROCESSING ON Key , Value //K3 NewKey //V3 NewValue context.write(NewKey, NewValue); } }
  • 13.
    What are I/OFormats?
  • 14.
    Job Configuration public staticvoid main(String[] args) throws Exception { //Create Configuration Configuration conf = new Configuration(); //Create Job Job job = new Job(conf, "YourApp"); //Specify Input Directory FileInputFormat.addInputPath(job, new Path(args[0])); //Specify Output Directory FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); //Specify Input Split Format By Which Mapper Reads <K, V> job.setInputFormatClass(KeyValueTextInputFormat.class) //Specify Output Format By Which Mapper Emits <K, V> job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(Reduce.class); //Specify Output Format By Which Reducer Emits <K, V> job.setOutputKeyClass(Text.class); job.setOutputKeyClass(Value.class); //Specify Output Format By Which Output is Written To The Output Files job.setOutputValueClass(IntWritable.class); job.setJarByClass(org.myorg.YourApp.class); job.waitForCompletion(true); }
  • 15.
    Reverse Indexing Application InputFile: Output File: /hdfs/f1.dat /hdfs_op/o1.dat f1 w1 w2 w3 w4 f2 w2 w3 w4 w5 f3 w3 w4 w5 w6 w1 f1 w4 f1, f2, f3 w2 f1, f2 w5 f2, f3 w3 f1, f2, f3 w6 f3 Hadoop System Job: CONF REDUCE MAP
  • 16.
    Mapper & ReducerAlgo Mapper: read line in K<filename>, V<rest contents> tokenize V for every token t: emit K<t>, V<filename> Reducer: receive K<token>, V[ ] <filenames> make unique list of V [ ] form a comma separated string of filenames in V [ ] as str emit K<token>, V<str>
  • 17.
    Actual Java Program:Mapper public static class Map extends Mapper<Text, Text, Text, Text> { private Text word = new Text(); public void map(Text key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String temp = tokenizer.nextToken(); //Strip last non-alphabet chars from a word if ( ! temp.matches(".*[a-zA-Z]$") ) { word.set(temp.substring(0, temp.length()-1)); } else word.set(temp); context.write(word, key); } } }
  • 18.
    Actual Java Program:Reducer public static class Reduce extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String doc_list = ""; HashMap<String, Integer> map = new HashMap<String, Integer>(); for (Text val : values) { map.put(val.toString(), 1); } Iterator<String> keySetIterator = map.keySet().iterator(); while(keySetIterator.hasNext()){ String k = keySetIterator.next(); doc_list += k + ","; } if (doc_list.length() > 0 && doc_list.charAt(doc_list.length()-1)==',') { doc_list = doc_list.substring(0, doc_list.length()-1); } context.write(key, new Text(doc_list)); } }
  • 19.
    Actual Java Program:Main() public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "reverse-index"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setInputFormatClass(KeyValueTextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setOutputFormatClass(TextOutputFormat.class); job.setJarByClass(org.myorg.ReverseIndex.class); job.waitForCompletion(true); }
  • 20.
    Actual Java Program:Complete App package org.myorg; //IMPORT RELEVANT API Libraries public class AppName { Mapper() { } Reducer() { } Main() { } }
  • 21.
    How to Exeucute? -Compile/usr/java/jdk1.7.0_25/bin/javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d classes ip1/ReverseIndex.java - make a JAR /usr/java/jdk1.7.0_25/bin/jar -cvf jar/reverse_index.jar -C classes/ . - submit the JAR as JOB hadoop jar jar/reverse_index.jar org.myorg.ReverseIndex ip op
  • 22.
  • 23.
    Important Links A fewexamples at my github: https://github.com/dkuldeep11/hadoop Clear Basics: https://www.udacity.com/course/ud617 Hadoop MR Concept: http://developer.yahoo.com/hadoop/tutorial/module4.html#basics MR Coding Basics: http://hadoop.apache.org/docs/stable1/mapred_tutorial.html In Depth: http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-mapreduce- programming/
  • 24.