Cs267 hadoop programming

Hadoop Installation &
MapReduce Programming
CS267 - Data Mining & Machine Learning
-Kuldeep Dhole

WHW
Why: To be able to deal with Big Data Mining.
How: By learning Hadoop & MR programming
What: Hadoop Installation, HDFS basics, & MR
programming for Hadoop

Hadoop Installation
Amazon EC2 cloud - Cloudera’s Hadoop
Insallationhttps://www.dropbox.com/s/s8zc3iwlq
936hak/Amazon_Cloudera_Hadoop.pdf

Hadoop Components
- HDFS (Hadoop Distributed File System)
- MapReduce Model

HDFS Shell
CLUSTER / LOCAL MACHINE
/home/user1
File System of Local OS (Linux,
Windows, etc.)
> ls -l
> mv f1 f2
> cp f1 f2
HDFS - /tmp
>hadoop fs -ls
>hadoop fs -mv hdfs_f1 hdfs_f2
>hadoop fs -cp hdfs_f1 hdfs_f2
- HDFS has its own shell
commands
- You need to transfer data:
LOCAL FS <-> HDFS
- Same concept applies to all
machines in the cluster &
Hadoop realm on all machines
are in sync.

MapReduce Concept
- Programming Model for Distributed Parallel
Computing.
- Used on scalable commodity hardware cluster.
- Can process Big Data (100s of GBs, TBs)
- Based on Key-Value structure.
- Parallel MAP tasks, which emit <K, V> data
- Parallel REDUCE tasks, which processes <K, V[ ]>
data

MapReduce Model
M1
M2
M3
M4
R1
R2
R3
R4
<K, V>
<K, V>
<K, V>
<K, V>
Sort, Merge &
Shuffle
<K, V>
<K, V>
<K, V>
<K, V>
<K1, V [ ] >
<K2 V [ ] >
<K3, V [ ] >
<K4, V [ ] >
<K, V>
<K, V>
<K, V>
<K, V>

MapReduce Model In Brief
(K1, V1) -> MAP -> List(K2, V2)
(K2, List(V2) -> REDUCE -> List(K3, V3)

Hadoop MapReduce Application
- Implemented in Java
- Components:
- Mapper
- Reducer
- Job Configuration
- Can be done in other languages like Python,
Perl, Shell, etc. using Streaming Concept.

Complete Application
public class YourApp {
Mapper {
}
Reducer {
}
Job Configuration {
}
}

Mapper Class & Function
public static class YourMap extends Mapper<K1, V1, K2, V2> {
public void map(K1 key, V1 value, Context context) throws
IOException, InterruptedException {
//DO YOUR PROCESSING ON Key ,
Value
//K2 NewKey
//V2 NewValue
context.write(NewKey, NewValue);
}
}

Reducer Class & Function
public static class YourReduce extends Reducer<K2, V2, K3, V3> {
public void reduce(K2 key, Iterable<V2> values, Context
context)
throws IOException, InterruptedException {
//DO YOUR PROCESSING ON Key , Value
//K3 NewKey
//V3 NewValue
context.write(NewKey, NewValue);
}
}

Job Configuration
public static void main(String[] args) throws Exception {
//Create Configuration
Configuration conf = new Configuration();
//Create Job
Job job = new Job(conf, "YourApp");
//Specify Input Directory
FileInputFormat.addInputPath(job, new Path(args[0]));
//Specify Output Directory
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Map.class);
//Specify Input Split Format By Which Mapper Reads <K, V>
job.setInputFormatClass(KeyValueTextInputFormat.class)
//Specify Output Format By Which Mapper Emits <K, V>
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
//Specify Output Format By Which Reducer Emits <K, V>
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(Value.class);
//Specify Output Format By Which Output is Written To The Output Files
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(org.myorg.YourApp.class);
job.waitForCompletion(true);
}

Reverse Indexing Application
Input File:
Output File:
/hdfs/f1.dat
/hdfs_op/o1.dat
f1 w1 w2
w3 w4
f2 w2 w3
w4 w5
f3 w3 w4
w5 w6
w1 f1
w4 f1, f2,
f3
w2 f1, f2
w5 f2, f3
w3 f1, f2,
f3
w6 f3
Hadoop System
Job:
CONF
REDUCE
MAP

Mapper & Reducer Algo
Mapper:
read line in K<filename>, V<rest contents>
tokenize V
for every token t:
emit K<t>, V<filename>
Reducer:
receive K<token>, V[ ] <filenames>
make unique list of V [ ]
form a comma separated string of filenames in V [ ] as str
emit K<token>, V<str>

Actual Java Program: Mapper
public static class Map extends Mapper<Text, Text, Text, Text> {
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String temp = tokenizer.nextToken();
//Strip last non-alphabet chars from a word
if ( ! temp.matches(".*[a-zA-Z]$") ) {
word.set(temp.substring(0, temp.length()-1));
}
else
word.set(temp);
context.write(word, key);
}
}
}

Actual Java Program: Reducer
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String doc_list = "";
HashMap<String, Integer> map = new HashMap<String, Integer>();
for (Text val : values) {
map.put(val.toString(), 1);
}
Iterator<String> keySetIterator = map.keySet().iterator();
while(keySetIterator.hasNext()){
String k = keySetIterator.next();
doc_list += k + ",";
}
if (doc_list.length() > 0 && doc_list.charAt(doc_list.length()-1)==',') {
doc_list = doc_list.substring(0, doc_list.length()-1);
}
context.write(key, new Text(doc_list));
}
}

Actual Java Program: Main()
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "reverse-index");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Map.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setJarByClass(org.myorg.ReverseIndex.class);
job.waitForCompletion(true);
}

Actual Java Program: Complete App
package org.myorg;
//IMPORT RELEVANT API Libraries
public class AppName {
Mapper() {
}
Reducer() {
}
Main() {
}
}

How to Exeucute?
- Compile/usr/java/jdk1.7.0_25/bin/javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar
-d classes ip1/ReverseIndex.java
- make a JAR
/usr/java/jdk1.7.0_25/bin/jar -cvf jar/reverse_index.jar -C classes/ .
- submit the JAR as JOB
hadoop jar jar/reverse_index.jar org.myorg.ReverseIndex ip op

Important Links
A few examples at my github:
https://github.com/dkuldeep11/hadoop
Clear Basics: https://www.udacity.com/course/ud617
Hadoop MR Concept:
http://developer.yahoo.com/hadoop/tutorial/module4.html#basics
MR Coding Basics:
http://hadoop.apache.org/docs/stable1/mapred_tutorial.html
In Depth: http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-mapreduce-
programming/

Cs267 hadoop programming

More Related Content

What's hot

Similar to Cs267 hadoop programming

Recently uploaded

Cs267 hadoop programming